PySpark sum() is an aggregate function that returns the SUM of selected columns, This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.
You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum()
, pyspark.pandas.DataFrame.sum()
and SQL sum.
First, let’s create a PySpark DataFrame.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount %"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
PySpark sum() Function
The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.
The following is the syntax of the sum() function.
# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column
By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee
column.
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
Yields below output.

Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
Yields below output.

GroupBy sum()
Let’s see another example of getting a sum for each group.
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
Yields below output.

PySpark SQL SUM
PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename
and calculate the sum
on the remaining numeric columns of DataFrame.
To run the SQL query use spark.sql()
function and create the table by using createOrReplaceTempView(). This table would be available to use until you end your current SparkSession.
spark.sql()
returns a DataFrame and here, I have used show() to display the contents to console.
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
I will leave this to you tor un and explore the output.
Using Pandas API
Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.
Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)
# Get the SUM
sumRes = df.sum()
print(sumRes)
Yields the below output.

Complete Example
The following is a complete example.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
df = ps.DataFrame(technologies)
print(df)
print(df.sum())
Conclusion
In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum e.t.c.
Related Articles
- PySpark unionByName() Example
- PySpark between() Example
- PySpark Broadcast Variable
- PySpark Broadcast Join
- PySpark persist() Example
- PySpark lag() Function
- PySpark Random Sample with Example
- PySpark reduceByKey usage with example
- Pyspark – Get substring() from a column
- Show First Top N Rows in Spark | PySpark
- PySpark Create DataFrame from List
- PySpark Concatenate Columns
- PySpark Refer Column Name With Dot (.)