PySpark max() Usage with Examples

In PySpark, the max() function is a powerful tool for computing the maximum value within a DataFrame column. This function allows users to efficiently identify the largest value present in a specific column, making it invaluable for various data analysis tasks.

1. PySpark max() Function on Column

pyspark.sql.functions.max() is used to compute the maximum value within a DataFrame column. It returns the maximum value present in the specified column. Note that it ignores the null/none values from the column when get the maximum value.

To get the max value on multiple columns, just specify the comma-separated max() functions for each column in the select() function.


# Using max() function
from pyspark.sql.functions import max
df.select(max(df.fee)).show()

# Using max() on multiple columns
from pyspark.sql.functions import max
df.select(max(df.fee).alias("fee_max"), 
          max(df.discount).alias("discount_max")
    ).show()

Yields below output. Here, max(df.fee) computes the maximum value present in the “fee” column

2. Using max() on Grouped Data

In PySpark, GroupedData.max() is a method used to compute the maximum value for each group in a grouped DataFrame. Here’s how it works:

First, you need to create a DataFrame.
apply groupBy() on one or more columns to create a GroupedData object.
Finally, you can call max() on the GroupedData object to compute the maximum value for each group.

Now perform GroupedData.max() to get the max for each course.


# groupby max on all columns
df.groupBy("CourseName").max() \
     .show() 

# groupby max on selected column
df.groupBy("CourseName").max("fee") \
     .show()

Yields below output.

3. agg() with Max

In PySpark, the agg() method with a dictionary argument is used to aggregate multiple columns simultaneously, applying different aggregation functions to each column.

With the dictionary argument, you can specify the column name as key and max as value to calculate the maximum value of a column.


# Using agg max
df.agg({'discount':'max','fee':'max'}).show()

Here’s how it works:

agg({'discount':'max','fee':'max'}): This expression specifies that we want to aggregate the “discount” and “fee” columns.
'discount':'max': This specifies that we want to compute the maximum value for the “discount” column.
'fee':'max': This specifies that we want to compute the maximum value for the “fee” column.

4. Using MAX on PySpark SQL

In PySpark SQL, obtaining the maximum value of a DataFrame column can be achieved through the max(column_name) function. To utilize SQL, it’s essential to first establish a temporary view by employing createOrReplaceTempView().

To execute the SQL query, utilize the spark.sql() function, and establish the table using createOrReplaceTempView(). This table remains accessible until the current SparkSession is closed.

spark.sql() executes the specified query and returns a new DataFrame.


# PySpark SQL MAX
df.createOrReplaceTempView("COURSE")
spark.sql("SELECT MAX(FEE) FROM COURSE").show()
spark.sql("SELECT MAX(FEE), MAX(DISCOUNT) FROM COURSE").show()
spark.sql("SELECT COURSENAME,MAX(FEE) FROM COURSE GROUP BY COURSENAME").show()

5. Complete Example of PySpark Max

The complete example of PySpark max with all the different functions.


# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)

# Using max() function
from pyspark.sql.functions import max
df.select(max(df.fee)).show()

# Using max() on multiple columns
from pyspark.sql.functions import max
df.select(max(df.fee).alias("fee_max"), 
          max(df.discount).alias("discount_max")
    ).show()

# groupby max on all columns
df.groupBy("CourseName").max() \
     .show() 

# groupby max on selected column
df.groupBy("CourseName").max("fee") \
     .show()      

# Using agg max
df.agg({'discount':'max','fee':'max'}).show()


df.createOrReplaceTempView("Course")
df2 = spark.sql("select coursename, max(fee) fee_max, max(discount) discount_max " \
                "from course group by coursename")
df2.show()
     
# Imports to use Pandas API
import pyspark.pandas as ps
import numpy as np

technologies = ({
    'Courses':["Spark",np.nan,"pandas","Java","Spark"],
    'Fee' :[20000,25000,30000,22000,np.NaN],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = ps.DataFrame(technologies)

print(df)
print(df.max())

6. Conclusion

In this article, you have learned how to get the max value of a column in PySpark DataFrame. By using functions.max(), GroupedData.max() you can get the max of a column, each of these functions is used for a different purpose. Also, you can use ANSI SQL to get the max.

References

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.GroupedData.html