You are currently viewing PySpark sum() Columns Example

PySpark sum() is an aggregate function that returns the SUM of selected columns. This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.

Advertisements

You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum(), pyspark.pandas.DataFrame.sum() and SQL sum.

First, let’s create a PySpark DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount %"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

PySpark sum() Function

The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.

The following is the syntax of the sum() function.


# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column

By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee column.


# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

Yields below output.

pyspark sum

Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.


# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

Yields below output.

pyspark sum multiple columns

GroupBy sum()

Let’s see another example of getting a sum for each group.


# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()  

Yields below output.

groupby

PySpark SQL SUM

PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename and calculate the sum on the remaining numeric columns of DataFrame.

spark.sql() returns a DataFrame.


# sum() on coulmns with spark.sql
df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()

Using Pandas API

Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.

Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })

# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)

# Get the SUM
sumRes = df.sum()
print(sumRes)

Yields the below output.

pyspark sum of columns

Complete Example

The following is a complete example.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()      

df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()
     
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = ps.DataFrame(technologies)

print(df)
print(df.sum())

Conclusion

In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum etc.

Related Articles