PySpark sum() Columns Example

The pyspark.sql.functions.sum() function is used in PySpark to calculate the sum of values in a column or across multiple columns in a DataFrame. It aggregates numerical data, providing a concise way to compute the total sum of numeric values within a DataFrame. This function is often used in combination with other DataFrame transformations, such as groupBy(), agg(), or withColumn(), to perform complex data manipulations and generate summary statistics.

PySpark sum() Function

The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.

The following is the syntax of the sum() function.


# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column

By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee column.


# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

Yields below output.

Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.


# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

Yields below output.

GroupBy sum()

Let’s see another example of getting a sum for each group.


# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()

Yields below output.

PySpark SQL SUM

PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename and calculate the sum on the remaining numeric columns of DataFrame.

spark.sql() returns a DataFrame.


# sum() on coulmns with spark.sql
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()

Using Pandas API

Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.

Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })

# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)

# Get the SUM
sumRes = df.sum()
print(sumRes)

Yields the below output.

Complete Example

The following is a complete example.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()      

df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()
     
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = ps.DataFrame(technologies)

print(df)
print(df.sum())

Conclusion

In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum etc. PySpark sum() is an aggregate function that returns the SUM of selected columns. This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.

PySpark sum() Function

GroupBy sum()

PySpark SQL SUM

Using Pandas API

Complete Example

Conclusion

Related Articles