Spark reduceByKey() with RDD Example

Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function.  It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair). redecuByKey() function is available in org.apache.spark.rdd.PairRDDFunctions

The output will be partitioned by either numPartitions or the default parallelism level. The Default partitioner is hash-partition.

First, let’s create an RDD from the list.


  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

As you see the data here, it’s in key/value pair. Key is the work name and value is the count.

Related:

Spark reduceByKey() Syntax

Below is the syntax of the Spark RDD reduceByKey() transformation


reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)

RDD reduceByKey() Example

In this example, reduceByKey() is used to reduces the word string by applying the + operator on value. The result of our RDD contains unique words and their count. 


  val rdd2=rdd.reduceByKey(_ + _)
  rdd2.foreach(println)

This yields below output.

pyspark rdd reducebyKey()

Complete reduceByKey() Scala Example

Below is complete example of RDD reduceByKey() transformation with Scala Example.


import org.apache.spark.sql.SparkSession

object ReduceByKeyExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

  val rdd2=rdd.reduceByKey(_ + _)

  rdd2.foreach(println)

}

Conclusion

In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.

Happy Learning !!

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 4 Comments

  1. Anonymous

    Scala, not python

    1. NNK

      Thanks for reading Spark reduceByKey() example. I’ve corrected it.

  2. Adelin

    Nice approach you can, create another example with combineByKey, reduceByKey, and flatMapValues

    1. NNK

      Hi Adelin. Thanks for reading. sure will create articles for combineByKey, reduceByKey, and flatMapValues.

You are currently viewing Spark reduceByKey() with RDD Example