Spark reduceByKey() with RDD Example

Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function.  It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair). redecuByKey() function is available in org.apache.spark.rdd.PairRDDFunctions

The output will be partitioned by either numPartitions or the default parallelism level. The Default partitioner is hash-partition.

First, let’s create an RDD from the list.


  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

As you see the data here, it’s in key/value pair. Key is the work name and value is the count.

Related:

Spark reduceByKey() Syntax

Below is the syntax of the Spark RDD reduceByKey() transformation


reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)

RDD reduceByKey() Example

In this example, reduceByKey() is used to reduces the word string by applying the + operator on value. The result of our RDD contains unique words and their count. 


  val rdd2=rdd.reduceByKey(_ + _)
  rdd2.foreach(println)

This yields below output.

pyspark rdd reducebyKey()

Complete reduceByKey() Scala Example

Below is complete example of RDD reduceByKey() transformation with Scala Example.


import org.apache.spark.sql.SparkSession

object ReduceByKeyExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

  val rdd2=rdd.reduceByKey(_ + _)

  rdd2.foreach(println)

}

Conclusion

In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

This Post Has 4 Comments

  1. Adelin

    Nice approach you can, create another example with combineByKey, reduceByKey, and flatMapValues

    1. NNK

      Hi Adelin. Thanks for reading. sure will create articles for combineByKey, reduceByKey, and flatMapValues.

  2. Anonymous

    Scala, not python

    1. NNK

      Thanks for reading Spark reduceByKey() example. I’ve corrected it.

Leave a Reply