Spark groupByKey()
and reduceByKey()
are transformation operations on key-value RDDs, but they differ in how they combine the values corresponding to each key. In this article, we shall discuss what is groupByKey(), what is reduceByKey, and the key differences between Spark groupByKey vs reduceByKey.
Table of contents
To understand the difference between Spark groupByKey() vs reduceBykey(), first, let’s understand each function separately and then will look at the differences.
1. Spark groupByKey()
Spark RDD groupByKey() is a transformation operation on a key-value RDD (Resilient Distributed Dataset) that groups the values corresponding to each key in the RDD. It returns a new RDD where each key is associated with a sequence of its corresponding values.
In Spark, the syntax for groupByKey()
is:
// Syntax of groupByKey()
def groupByKey(): RDD[(K, Iterable[V])]
The groupByKey()
method is defined on a key-value RDD, where each element in the RDD is a tuple of (K, V)
representing a key-value pair. It returns a new RDD where each key is associated with an iterable collection of its corresponding values.
Here’s an example of using groupByKey()
:
// groupByKey() example
val rdd1 = spark.sparkContext.parallelize(Seq(("key1", 1), ("key2", 2), ("key1", 3), ("key2", 4)))
val rdd2 = rdd1.groupByKey()
// Result
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize
rdd2: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[1] at groupByKey
In the above code, rdd1
is an RDD of key-value pairs. The groupByKey()
transformation is applied on rdd1
which returns a new RDD rdd2
where each key is associated with a sequence of its corresponding values.
To retrieve the values corresponding to a particular key, you can use the lookup()
method as follows:
val key1Values = rdd2.lookup("key1")
// Result
key1Values: Seq[Iterable[Int]] = ArrayBuffer(CompactBuffer(1, 3))
The above code retrieves all the values corresponding to the key “key1” in rdd2
.
2. Spark reduceByKey()
Spark RDD reduceByKey() is another transformation operation on a key-value RDD (Resilient Distributed Dataset) that groups the values corresponding to each key in the RDD and then applies a reduction function to the values of each group. It returns a new RDD where each key is associated with a single reduced value.
The syntax for reduceByKey
is as follows:
// Syntax of reduceByKey()
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
In the above code, func
is a function that takes two values of type V
(i.e., the values corresponding to the same key) and returns a single value of type V
. The function func
should be commutative and associative, as it will be applied in a parallel and distributed manner across the values corresponding to each key.
Here is an example of using reduceByKey
to compute the sum of values for each key:
// reduceByKey() Example
val rdd = spark.sparkContext.parallelize(Seq(("a", 1), ("b", 2), ("a", 3), ("b", 4)))
val reduced = rdd.reduceByKey(_ + _)
// To get values of the reduce we shall use lookup with key
val aValues = reduced.lookup("a")
val bValues = reduced.lookup("b")
// Result:
aValues: Seq[Int] = ArrayBuffer(4)
bValues: Seq[Int] = ArrayBuffer(6)
In the above code, we create an RDD rdd
with four key-value pairs, and then apply reduceByKey
with the function _ + _
to compute the sum of values for each key. The resulting RDD reduced
has two elements, one for each distinct key in rdd
, where each element is a tuple of (K, V)
representing the key and its reduced value.
3. Spark groupByKey() vs reduceByKey():
In Spark, both groupByKey
and reduceByKey
are wide-transformation operations on key-value RDDs resulting in data shuffling, but they differ in how they combine the values corresponding to each key.
Let us assume we have 2 partitions with the below data and try getting the word count for this using groupByKey and reduceByKey
Partition1:
+------+
|APPLE |
|ORANGE|
+------+
Partition2:
+------+
|ORANGE|
|ORANGE|
+------+
3.1. Using groupByKey() on the data:
Using groupByKey
on the above partitions, groups the values corresponding to each key and returns an RDD of key-value pairs, where each value is an iterable collection of the values corresponding to that key. This operation can be useful in situations where you need to process all the values corresponding to each key together, such as computing aggregates or performing complex operations on groups of related data.
Task1:
+--------+
|APPLE 1|
|ORANGE 1|
+--------+
Task2:
+--------+
|ORANGE 1|
|ORANGE 1|
+--------+
All 4 elements from Task 1 and 2 will be sent over the network to the Task performing the reduced operation.
Task performing reduce
+--------+
|APPLE 1|
|ORANGE 1|
|ORANGE 1|
|ORANGE 1|
+--------+
Final result
+--------+
|APPLE 1|
|ORANGE 3|
+--------+
There are two major issues possible when using groupByKey–
- The data is not combined or reduced on the map side, we transferred all elements over the network during the shuffle.
- All elements are sent to the task performing the aggregate operation, the number of elements to be handled by the task will be more and could possibly result in an Out of Memory exception.
3.2. Using reduceByKey() on the data:
Just like groupByKey(), on the same word count problem, since we have two partitions we will end up with 2 tasks. However, with a map side combine, the output of the tasks will look like the below –
Task1:
+--------+
|APPLE 1|
|ORANGE 1|
+--------+
Task2:
+--------+
|ORANGE 2|
+--------+
With reduceByKey, only 3 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.
Task performing reduce
+--------+
|APPLE 1|
|ORANGE 1|
|ORANGE 2|
+--------+
Final result
+--------+
|APPLE 1|
|ORANGE 3|
+--------+
reduceByKey
groups the values corresponding to each key and then applies a reduction function to the values of each group, returning an RDD of key-value pairs, where each value is the result of the reduction operation for the values corresponding to that key. This operation is more efficient than groupByKey
because it performs the reduction operation on each group of values before shuffling the data, reducing the amount of data that needs to be shuffled across the network.
4. Conclusion
In this article, you have learned the difference between Spark RDD reduceBykey() vs groupByKey(). In general, you should use reduceByKey
instead of groupByKey
whenever possible, as reduceByKey
can significantly reduce the amount of data shuffled across the network and thus improve performance. However, reduceByKey
requires a reduction function that is both commutative and associative, whereas groupByKey
does not have this requirement and can be used in more general situations.
5. Related Articles
- Spark reduceByKey() with RDD Example
- Spark groupByKey()
- Spark RDD aggregateByKey()
- Spark Internal Execution plan
- Spark map() Transformation
- Spark RDD Transformations with examples
- What is DAG in Spark or PySpark
- Spark Large vs Small Parquet Files
- What is Spark Executor
- Spark saveAsTextFile() Usage with Example
- Reduce Key-Value Pair into Key-list Pair
- reduceByKey vs groupByKey vs aggregateByKey vs combineByKey in Spark