• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:12 mins read
You are currently viewing PySpark sum() Columns Example

PySpark sum() is an aggregate function that returns the SUM of selected columns. This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.

You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum(), pyspark.pandas.DataFrame.sum() and SQL sum.

First, let’s create a PySpark DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount %"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

PySpark sum() Function

The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.

The following is the syntax of the sum() function.


# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column

By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee column.


# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

Yields below output.

pyspark sum

Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.


# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

Yields below output.

pyspark sum multiple columns

GroupBy sum()

Let’s see another example of getting a sum for each group.


# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()  

Yields below output.

groupby

PySpark SQL SUM

PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename and calculate the sum on the remaining numeric columns of DataFrame.

To run the SQL query use spark.sql() function and create the table by using createOrReplaceTempView(). This table would be available to use until you end your current SparkSession.

spark.sql() returns a DataFrame and here, I have used show() to display the contents to console.


# sum() on coulmns with spark.sql
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()

Using Pandas API

Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.

Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })

# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)

# Get the SUM
sumRes = df.sum()
print(sumRes)

Yields the below output.

pyspark sum of columns

Complete Example

The following is a complete example.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()      

df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()
     
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = ps.DataFrame(technologies)

print(df)
print(df.sum())

Frequently Asked Questions on sum()

How to handle null values when using sum() PySpark?

the sum() function in PySpark ignores null values. If you want to include null values in the sum, you can use the coalesce() function. Below is an example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, coalesce
spark = SparkSession.builder.appName(“sum examples”).getOrCreate()
# Prepare Data with null values
data = [(100,), (200,), (None,)]
columns = [“marks”]
df = spark.createDataFrame(data, columns)
# Calculate the sum of the ‘marks’ column including null values
sum_total = df.agg(sum(coalesce(df[“marks”], 0))).collect()[0][0]
print(f”Total Sum (including null values): {sum_total}”)

Conclusion

In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum etc.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium