PySpark sum() Columns Example

  • Post author:
  • Post category:PySpark
  • Post last modified:December 15, 2022

PySpark sum() is an aggregate function that returns the SUM of selected columns, This function should be used on a numeric column. The sum of a column is also referred to as the total values of a column.

You can calculate the sum of a column in PySpark in several ways for example by using pyspark.sql.functions.sum(), pyspark.pandas.DataFrame.sum() and SQL sum.

First, let’s create a PySpark DataFrame.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount %"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

PySpark sum() Function

The sum() is a built-in function of PySpark SQL that is used to get the total of a specific column. This function takes the column name is the Column format and returns the result in the Column.

The following is the syntax of the sum() function.


# Syntax of functions.sum()
pyspark.sql.functions.sum(col: ColumnOrName) → pyspark.sql.column.Column

By using the sum() function let’s get the sum of the column. The below example returns a sum of the fee column.


# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

Yields below output.

pyspark sum

Let’s perform the sum() on multiple columns. In this example, I have also used the PySpark alias() to assign the custom name to the sum column.


# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

Yields below output.

pyspark sum multiple columns

GroupBy sum()

Let’s see another example of getting a sum for each group.


# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()  

Yields below output.

groupby

PySpark SQL SUM

PySpark SQL also provides a way to run the operations in the ANSI SQL statements. Hence, lets perform the groupby on coursename and calculate the sum on the remaining numeric columns of DataFrame.

To run the SQL query use spark.sql() function and create the table by using createOrReplaceTempView(). This table would be available to use until you end your current SparkSession.

spark.sql() returns a DataFrame and here, I have used show() to display the contents to console.


df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()

I will leave this to you tor un and explore the output.

Using Pandas API

Finally, if you are using Pandas with PySpark use the following. This function returns a sum of DataFrame as a Series.

Note that PySpark DataFrame doesn’t have a method sum(), however, we can use Pandas API over PySpark to use the sum() method.


# Imports
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })

# Create a DataFrame
df = ps.DataFrame(technologies)
print(df)

# Get the SUM
sumRes = df.sum()
print(sumRes)

Yields the below output.

pyspark sum of columns

Complete Example

The following is a complete example.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using sum() function
from pyspark.sql.functions import sum
df.select(sum(df.fee)).show()

# Using sum() on multiple columns
from pyspark.sql.functions import sum
df.select(sum(df.fee).alias("fee_sum"), 
          sum(df.discount).alias("discount_sum")
    ).show()

# groupby sum on all columns
df.groupBy("CourseName").sum() \
     .show() 

# groupby sum on selected column
df.groupBy("CourseName").sum("fee") \
     .show()      

df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, sum(fee) fee_sum, sum(discount) discount_sum " \
                "from course group by coursename")
df2.show()
     
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = ps.DataFrame(technologies)

print(df)
print(df.sum())

Conclusion

In this article, you have learned how to calculate the sum of columns in PySpark by using SQL function sum(), pandas API, group by sum e.t.c.

Related Articles

Leave a Reply