PySpark sum() is an aggregate function that returns the SUM of selected columns. This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.
You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum()
, pyspark.pandas.DataFrame.sum()
and SQL sum.
First, let’s create a PySpark DataFrame.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount %"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
PySpark sum() Function
The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.
The following is the syntax of the sum() function.
# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column
By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee
column.
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
Yields below output.
Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
Yields below output.
GroupBy sum()
Let’s see another example of getting a sum for each group.
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
Yields below output.
PySpark SQL SUM
PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename
and calculate the sum
on the remaining numeric columns of DataFrame.
spark.sql()
returns a DataFrame.
# sum() on coulmns with spark.sql
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
Using Pandas API
Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.
Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.
# Imports
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)
# Get the SUM
sumRes = df.sum()
print(sumRes)
Yields the below output.
Complete Example
The following is a complete example.
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# Prepare Data
simpleData = (("Java",4000,5), \
("Python", 4600,10), \
("Scala", 4100,15), \
("Scala", 4500,15), \
("PHP", 3000,20), \
)
columns= ["CourseName", "fee", "discount"]
# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()
# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"),
sum(df.discount).alias("discount_sum")
).show()
# groupby sum on all columns
df.groupBy("CourseName").sum() \
.show()
# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
.show()
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
"from course group by coursename")
df2.show()
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np
technologies = ({
'Courses':["Spark",np.nan,"pandas","Java","Spark"],
'Fee' :[20000,25000,30000,22000,np.NaN],
'Duration':['30days','40days','35days','60days','50days'],
'Discount':[1000,2500,1500,1200,3000]
})
df = ps.DataFrame(technologies)
print(df)
print(df.sum())
Conclusion
In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum etc.
Related Articles
- PySpark unionByName() Example
- PySpark between() Example
- PySpark count() – Different Methods Explained
- PySpark sum() Columns Example
- PySpark Groupby Agg (aggregate) – Explained
- PySpark lag() Function
- PySpark Random Sample with Example
- PySpark reduceByKey usage with example
- PySpark apply Function to Column
- Show First Top N Rows in Spark | PySpark
- PySpark Create DataFrame from List
- PySpark Concatenate Columns
- PySpark Refer Column Name With Dot (.)