The min() function is used to get the minimum value of the DataFrame column and max() function is used to get the maximum value of the column. These functions are also available on RDD to get the min & max values.
In this article, I will explain some examples of how you can calculate the minimum and maximum values from Spark DataFrame, RDD, and PairRDD.
1. Spark Get Min & Max Value of DataFrame Column
Let’s run with an example of getting min & max values of a Spark DataFrame column. First, create a DataFrame with a column named “salary”, and find the minimum and maximum values of the column. You can do this using the agg
function and passing in the min
and max
functions:
// Imports
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark = SparkSession.builder()
.appName("Creating DataFrame")
.master("local[*]")
.getOrCreate()
// Create DataFrame
val customerData = List(
("John", "1", "20000"),
("Jane", "2", "30000"),
("Bob", "3", "40000")
)
val rdd = spark.sparkContext.parallelize(customerData)
val df = rdd.toDF("name", "id", "salary")
// Get min & max value of a column
val min_value = df.agg(min("salary")).head().get(0)
val max_value = df.agg(max("salary")).head().get(0)
// Output
min_value: Int = 20000
max_value: Int = 40000
This code creates a DataFrame, then uses the agg
function to find the minimum and maximum values of the “salary” column. The head
method returns the first row of the resulting DataFrame, and get(0)
retrieves the first column value, which is the minimum or maximum value.
These are just a few examples of how you can calculate the minimum and maximum values using Apache Spark. There are many other ways to accomplish this depending on your specific use case.
2. Finding the Minimum and Maximum Values in an RDD
Suppose you have an RDD of integers, and you want to find the Spark min and max values. You can use the min()
and max()
functions directly on the RDD:
// Create RDD
val rdd = spark.sparkContext.parallelize(Array(1, 2, 3, 4, 5))
// Find min & max values
val min_value = rdd.min()
val max_value = rdd.max()
// Output
min_value: Int = 1
max_value: Int = 5
This code creates an RDD of integers, then uses the Spark min()
and max()
functions to find the minimum and maximum values. Note that sc
is the SparkContext object, which you need to create before using Spark.
3. Finding the Minimum and Maximum Values in a PairRDD
Suppose you have a PairRDD of (key, value) pairs, and you want to find the spark min and max values of the values for each key. You can use the reduceByKey() function to group the values by key, then use the min
and max
functions on the resulting RDD:
val pair_rdd = spark.sparkContext.parallelize(Seq((1, 2), (1, 3), (2, 1), (2, 5)))
val min_max_rdd = pair_rdd.reduceByKey((a, b) => (Math.min(a, b), Math.max(a, b)))
This code creates a PairRDD of (key, value) pairs, then uses the reduceByKey
function to group the values by key and find the minimum and maximum values for each key. The reduceByKey
function takes a lambda function that receives two values and returns a tuple with the minimum and maximum values.
4. Conclusion
In conclusion, calculating the minimum and maximum values is a common operation when working with big data, and Apache Spark provides functions min() & max() to calculate the minimum and maximum values respectively.
Related Articles
- Spark RDD Transformations with examples
- Spark DataFrame Tutorial with Examples
- Spark RDD Actions with examples
- Spark RDD aggregateByKey()
- Spark RDD joins with Examples
- Testing Spark locally with EmbeddedKafka: Streamlining Spark Streaming Tests
- Spark Kryoserializer buffer max
- Spark with SQL Server – Read and Write Table
- reduceByKey vs groupByKey vs aggregateByKey vs combineByKey in Spark
- Reduce Key-Value Pair into Key-list Pair
- Spark Extract Values from a Row Object