Spark groupByKey() vs reduceByKey()

Spark groupByKey() and reduceByKey() are transformation operations on key-value RDDs, but they differ in how they combine the values corresponding to each key. In this article, we shall discuss what is groupByKey(), what is reduceByKey, and the key differences between Spark groupByKey vs reduceByKey.

1. Spark groupByKey()

Spark RDD groupByKey() is a transformation operation on a key-value RDD (Resilient Distributed Dataset) that groups the values corresponding to each key in the RDD. It returns a new RDD where each key is associated with a sequence of its corresponding values.

In Spark, the syntax for groupByKey() is:


// Syntax of groupByKey()
def groupByKey(): RDD[(K, Iterable[V])]

The groupByKey() method is defined on a key-value RDD, where each element in the RDD is a tuple of (K, V) representing a key-value pair. It returns a new RDD where each key is associated with an iterable collection of its corresponding values.

Here’s an example of using groupByKey():


// groupByKey() example
val rdd1 = spark.sparkContext.parallelize(Seq(("key1", 1), ("key2", 2), ("key1", 3), ("key2", 4)))
val rdd2 = rdd1.groupByKey()

// Result
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize 
rdd2: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[1] at groupByKey

In the above code, rdd1 is an RDD of key-value pairs. The groupByKey() transformation is applied on rdd1 which returns a new RDD rdd2 where each key is associated with a sequence of its corresponding values.

To retrieve the values corresponding to a particular key, you can use the lookup() method as follows:


val key1Values = rdd2.lookup("key1")

// Result
key1Values: Seq[Iterable[Int]] = ArrayBuffer(CompactBuffer(1, 3))

The above code retrieves all the values corresponding to the key “key1” in rdd2.

2. Spark reduceByKey()

Spark RDD reduceByKey() is another transformation operation on a key-value RDD (Resilient Distributed Dataset) that groups the values corresponding to each key in the RDD and then applies a reduction function to the values of each group. It returns a new RDD where each key is associated with a single reduced value.

The syntax for reduceByKey is as follows:


// Syntax of reduceByKey()
def reduceByKey(func: (V, V) => V): RDD[(K, V)]

In the above code, func is a function that takes two values of type V (i.e., the values corresponding to the same key) and returns a single value of type V. The function func should be commutative and associative, as it will be applied in a parallel and distributed manner across the values corresponding to each key.

Here is an example of using reduceByKey to compute the sum of values for each key:


// reduceByKey() Example
val rdd = spark.sparkContext.parallelize(Seq(("a", 1), ("b", 2), ("a", 3), ("b", 4)))
val reduced = rdd.reduceByKey(_ + _)

// To get values of the reduce we shall use lookup with key
val aValues = reduced.lookup("a")
val bValues = reduced.lookup("b")

// Result:
aValues: Seq[Int] = ArrayBuffer(4)
bValues: Seq[Int] = ArrayBuffer(6)

In the above code, we create an RDD rdd with four key-value pairs, and then apply reduceByKey with the function _ + _ to compute the sum of values for each key. The resulting RDD reduced has two elements, one for each distinct key in rdd, where each element is a tuple of (K, V) representing the key and its reduced value.

3. Spark groupByKey() vs reduceByKey():

In Spark, both groupByKey and reduceByKey are wide-transformation operations on key-value RDDs resulting in data shuffling, but they differ in how they combine the values corresponding to each key.

Let us assume we have 2 partitions with the below data and try getting the word count for this using groupByKey and reduceByKey


Partition1:
+------+
|APPLE |
|ORANGE|
+------+
Partition2:
+------+
|ORANGE|
|ORANGE|
+------+

3.1. Using groupByKey() on the data:

Using groupByKey on the above partitions, groups the values corresponding to each key and returns an RDD of key-value pairs, where each value is an iterable collection of the values corresponding to that key. This operation can be useful in situations where you need to process all the values corresponding to each key together, such as computing aggregates or performing complex operations on groups of related data.


Task1:
+--------+
|APPLE  1|
|ORANGE 1|
+--------+
Task2:
+--------+
|ORANGE 1|
|ORANGE 1|
+--------+

All 4 elements from Task 1 and 2 will be sent over the network to the Task performing the reduced operation.


Task performing reduce
+--------+
|APPLE  1|
|ORANGE 1|
|ORANGE 1|
|ORANGE 1|
+--------+
Final result
+--------+
|APPLE  1|
|ORANGE 3|
+--------+

There are two major issues possible when using groupByKey–

The data is not combined or reduced on the map side, we transferred all elements over the network during the shuffle.
All elements are sent to the task performing the aggregate operation, the number of elements to be handled by the task will be more and could possibly result in an Out of Memory exception.

3.2. Using reduceByKey() on the data:

Just like groupByKey(), on the same word count problem, since we have two partitions we will end up with 2 tasks. However, with a map side combine, the output of the tasks will look like the below –


Task1:
+--------+
|APPLE  1|
|ORANGE 1|
+--------+
Task2:
+--------+
|ORANGE 2|
+--------+

With reduceByKey, only 3 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.


Task performing reduce
+--------+
|APPLE  1|
|ORANGE 1|
|ORANGE 2|
+--------+
Final result
+--------+
|APPLE  1|
|ORANGE 3|
+--------+

reduceByKey groups the values corresponding to each key and then applies a reduction function to the values of each group, returning an RDD of key-value pairs, where each value is the result of the reduction operation for the values corresponding to that key. This operation is more efficient than groupByKey because it performs the reduction operation on each group of values before shuffling the data, reducing the amount of data that needs to be shuffled across the network.

4. Conclusion

In this article, you have learned the difference between Spark RDD reduceBykey() vs groupByKey(). In general, you should use reduceByKey instead of groupByKey whenever possible, as reduceByKey can significantly reduce the amount of data shuffled across the network and thus improve performance. However, reduceByKey requires a reduction function that is both commutative and associative, whereas groupByKey does not have this requirement and can be used in more general situations.

Table of contents