Site icon Spark By {Examples}

Spark groupByKey()

Spark GroupByKey

The Spark or PySpark groupByKey() is the most frequently used wide transformation operation that involves shuffling of data across the executors when data is not partitioned on the Key. It takes key-value pairs (K, V) as an input, groups the values based on the key(K), and generates a dataset of KeyValueGroupedDataset(K, Iterable) pairs as an output.

It’s a very expensive operation and consumes a lot of memory if the dataset is huge.

Now let us see how we can implement the Spark groupByKey() function on an RDD with an example

1. Spark groupByKey()


// Create an RDD
val rdd = spark.sparkContext.parallelize(Seq(("A",1),("A",3),("B",4),("B",2),("C",5)))  
// Get the data in RDD
val rdd_collect = rdd.collect()

// Output:
rdd_collect: Array[(String, Int)] = Array((A,1), (A,3), (B,4), (B,2), (C,5))

Above we have created an RDD which represents an Array of (name: String, count: Int) and now we want to group those names using Spark groupByKey() function to generate a dataset of Arrays for which each item represents the distribution of the count of each name like this (name, (id1, id2) is unique).


// Group rdd based on key
rdd.groupByKey.collect()

// Output:
res15: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(1, 3)), (B,CompactBuffer(4, 2)), (C,CompactBuffer(5)))

Note: Spark groupByKey() method is recommended when there are no required aggregation over each key.

2. Compare groupByKey vs reduceByKey

When we work on large datasets, reduceByKey() function is more preffered when compared with Spark groupByKey() function. Let us check it out with an example.


// Using ReduceByKey() function
val nameCountsWithReduce = rdd
     .reduceByKey(_ + _)
     .collect()

// Using GroupByKey() function
val nameCountsWithGroup = rdd
  .groupByKey()
  .map(x => (x._1, x._2.sum))
  .collect()

// Output:
nameCountsWithReduce: Array[(String, Int)] = Array((A,4), (B,6), (C,5))
nameCountsWithGroup: Array[(String, Int)] = Array((A,4), (B,6), (C,5))

From the above example, it is clear that the output of both functions results in the same whereas reduceByKey() works much better on a large dataset when compared to the Spark groupByKey() function. The main reason for the performance difference is that reduceByKey() results in less shuffling of data as Spark knows it can combine output with a common key on each partition before shuffling the data.

Look at the below diagram to understand what happens when we use reduceByKey() vs Spark groupByKey() on our example dataset.

spark groupbykey
Spark groupByKey()

From the above diagram, when calling Spark groupByKey on the dataset,

Whereas when we use reduceByKey on the same datasets,

3. Conclusion

Both Spark groupByKey() and reduceByKey() are part of the wide transformation that performs shuffling at some point each. The main difference is when we are working on larger datasets reduceByKey is faster as the rate of shuffling is less than compared with Spark groupByKey(). We can also use combineByKey() and foldByKey() as a replacement to groupByKey()

Exit mobile version