Spark reduceByKey() with RDD Example

Spread the love

Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function.  It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair). redecuByKey() function is available in org.apache.spark.rdd.PairRDDFunctions

The output will be partitioned by either numPartitions or the default parallelism level. The Default partitioner is hash-partition.

First, let’s create an RDD from the list.


  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

As you see the data here, it’s in key/value pair. Key is the work name and value is the count.

Related:

Spark reduceByKey() Syntax

Below is the syntax of the Spark RDD reduceByKey() transformation


reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)

RDD reduceByKey() Example

In this example, reduceByKey() is used to reduces the word string by applying the + operator on value. The result of our RDD contains unique words and their count. 


  val rdd2=rdd.reduceByKey(_ + _)
  rdd2.foreach(println)

This yields below output.

pyspark rdd reducebyKey()

Complete reduceByKey() Scala Example

Below is complete example of RDD reduceByKey() transformation with Scala Example.


import org.apache.spark.sql.SparkSession

object ReduceByKeyExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

  val rdd2=rdd.reduceByKey(_ + _)

  rdd2.foreach(println)

}

Conclusion

In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.

Happy Learning !!

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 4 Comments

  1. Anonymous

    Scala, not python

    1. NNK

      Thanks for reading Spark reduceByKey() example. I’ve corrected it.

  2. Adelin

    Nice approach you can, create another example with combineByKey, reduceByKey, and flatMapValues

    1. NNK

      Hi Adelin. Thanks for reading. sure will create articles for combineByKey, reduceByKey, and flatMapValues.

You are currently viewing Spark reduceByKey() with RDD Example