You are currently viewing Spark min() & max() with Examples

The min() function is used to get the minimum value of the DataFrame column and max() function is used to get the maximum value of the column. These functions are also available on RDD to get the min & max values.

Advertisements

In this article, I will explain some examples of how you can calculate the minimum and maximum values from Spark DataFrame, RDD, and PairRDD.

1. Spark Get Min & Max Value of DataFrame Column

Let’s run with an example of getting min & max values of a Spark DataFrame column. First, create a DataFrame with a column named “salary”, and find the minimum and maximum values of the column. You can do this using the agg function and passing in the min and max functions:


// Imports
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark = SparkSession.builder()
        .appName("Creating DataFrame")
        .master("local[*]")
        .getOrCreate()

// Create DataFrame
val customerData = List(
      ("John", "1", "20000"),
      ("Jane", "2", "30000"),
      ("Bob", "3", "40000")
    )
val rdd = spark.sparkContext.parallelize(customerData)
val df = rdd.toDF("name", "id", "salary")

// Get min & max value of a column
val min_value = df.agg(min("salary")).head().get(0)
val max_value = df.agg(max("salary")).head().get(0)

// Output
min_value: Int = 20000
max_value: Int = 40000

This code creates a DataFrame, then uses the agg function to find the minimum and maximum values of the “salary” column. The head method returns the first row of the resulting DataFrame, and get(0) retrieves the first column value, which is the minimum or maximum value.

These are just a few examples of how you can calculate the minimum and maximum values using Apache Spark. There are many other ways to accomplish this depending on your specific use case.

2. Finding the Minimum and Maximum Values in an RDD

Suppose you have an RDD of integers, and you want to find the Spark min and max values. You can use the min() and max() functions directly on the RDD:


// Create RDD
val rdd = spark.sparkContext.parallelize(Array(1, 2, 3, 4, 5))

// Find min & max values
val min_value = rdd.min()
val max_value = rdd.max()

// Output
min_value: Int = 1
max_value: Int = 5

This code creates an RDD of integers, then uses the Spark min() and max() functions to find the minimum and maximum values. Note that sc is the SparkContext object, which you need to create before using Spark.

3. Finding the Minimum and Maximum Values in a PairRDD

Suppose you have a PairRDD of (key, value) pairs, and you want to find the spark min and max values of the values for each key. You can use the reduceByKey() function to group the values by key, then use the min and max functions on the resulting RDD:


val pair_rdd = spark.sparkContext.parallelize(Seq((1, 2), (1, 3), (2, 1), (2, 5)))
val min_max_rdd = pair_rdd.reduceByKey((a, b) => (Math.min(a, b), Math.max(a, b)))

This code creates a PairRDD of (key, value) pairs, then uses the reduceByKey function to group the values by key and find the minimum and maximum values for each key. The reduceByKey function takes a lambda function that receives two values and returns a tuple with the minimum and maximum values.

4. Conclusion

In conclusion, calculating the minimum and maximum values is a common operation when working with big data, and Apache Spark provides functions min() & max() to calculate the minimum and maximum values respectively.

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.