You are currently viewing Spark reduceByKey() with RDD Example

Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function.  It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair). redecuByKey() function is available in org.apache.spark.rdd.PairRDDFunctions

The output will be partitioned by either numPartitions or the default parallelism level. The Default partitioner is hash-partition.

First, let’s create an RDD from the list.


  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

As you see the data here, it’s in key/value pair. Key is the work name and value is the count.

Related:

Spark reduceByKey() Syntax

Below is the syntax of the Spark RDD reduceByKey() transformation


reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)

RDD reduceByKey() Example

In this example, reduceByKey() is used to reduces the word string by applying the + operator on value. The result of our RDD contains unique words and their count. 


  val rdd2=rdd.reduceByKey(_ + _)
  rdd2.foreach(println)

This yields below output.

pyspark rdd reducebyKey()

Complete reduceByKey() Scala Example

Below is complete example of RDD reduceByKey() transformation with Scala Example.


import org.apache.spark.sql.SparkSession

object ReduceByKeyExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project", 1),
  ("Gutenberg’s", 1),
  ("Alice’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1),
  ("Adventures", 1),
  ("in", 1),
  ("Wonderland", 1),
  ("Project", 1),
  ("Gutenberg’s", 1))

  val rdd=spark.sparkContext.parallelize(data)

  val rdd2=rdd.reduceByKey(_ + _)

  rdd2.foreach(println)

}

Conclusion

In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 4 Comments

  1. NNK

    Hi Adelin. Thanks for reading. sure will create articles for combineByKey, reduceByKey, and flatMapValues.

  2. Adelin

    Nice approach you can, create another example with combineByKey, reduceByKey, and flatMapValues

  3. NNK

    Thanks for reading Spark reduceByKey() example. I’ve corrected it.

  4. Anonymous

    Scala, not python

Comments are closed.