The pyspark.sql.functions.sum()
function is used in PySpark to calculate the sum of values in a column or across multiple columns in a DataFrame. It aggregates numerical data, providing a concise way to compute the total sum of numeric values within a DataFrame. This function is often used in combination with other DataFrame transformations, such as groupBy()
, agg()
, or withColumn()
, to perform complex data manipulations and generate summary statistics.
You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum()
, pyspark.pandas.DataFrame.sum()
and SQL sum.
First, let’s create a PySpark DataFrame.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount %"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
PySpark sum() Function
The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.
The following is the syntax of the sum() function.
# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column
By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee
column.
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
Yields below output.
Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
Yields below output.
GroupBy sum()
Let’s see another example of getting a sum for each group.
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
Yields below output.
PySpark SQL SUM
PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename
and calculate the sum
on the remaining numeric columns of DataFrame.
spark.sql()
returns a DataFrame.
# sum() on coulmns with spark.sql
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
Using Pandas API
Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.
Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)
# Get the SUM
sumRes = df.sum()
print(sumRes)
Yields the below output.
Complete Example
The following is a complete example.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
df = ps.DataFrame(technologies)
print(df)
print(df.sum())
Conclusion
In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum etc. PySpark sum() is an aggregate function that returns the SUM of selected columns. This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.
Related Articles
- PySpark unionByName() Example
- PySpark between() Example
- PySpark count() – Different Methods Explained
- PySpark sum() Columns Example
- PySpark Groupby Agg (aggregate) – Explained
- PySpark lag() Function
- PySpark Random Sample with Example
- PySpark reduceByKey usage with example
- PySpark apply Function to Column
- Show First Top N Rows in Spark | PySpark
- PySpark Create DataFrame from List
- PySpark Concatenate Columns
- PySpark Refer Column Name With Dot (.)