Spark RDD reduceByKey()
transformation is used to merge the values of each key using an associative reduce function. It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair). redecuByKey() function is available in org.apache.spark.rdd.PairRDDFunctions
The output will be partitioned by either numPartitions or the default parallelism level. The Default partitioner is hash-partition.
First, let’s create an RDD from the list.
val data = Seq(("Project", 1),
("Gutenberg’s", 1),
("Alice’s", 1),
("Adventures", 1),
("in", 1),
("Wonderland", 1),
("Project", 1),
("Gutenberg’s", 1),
("Adventures", 1),
("in", 1),
("Wonderland", 1),
("Project", 1),
("Gutenberg’s", 1))
val rdd=spark.sparkContext.parallelize(data)
As you see the data here, it’s in key/value pair. Key is the work name and value is the count.
Related:
Spark reduceByKey() Syntax
Below is the syntax of the Spark RDD reduceByKey()
transformation
reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)
RDD reduceByKey() Example
In this example, reduceByKey()
is used to reduces the word string by applying the + operator on value. The result of our RDD contains unique words and their count.
val rdd2=rdd.reduceByKey(_ + _)
rdd2.foreach(println)
This yields below output.

Complete reduceByKey() Scala Example
Below is complete example of RDD reduceByKey()
transformation with Scala Example.
import org.apache.spark.sql.SparkSession
object ReduceByKeyExample extends App{
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val data = Seq(("Project", 1),
("Gutenberg’s", 1),
("Alice’s", 1),
("Adventures", 1),
("in", 1),
("Wonderland", 1),
("Project", 1),
("Gutenberg’s", 1),
("Adventures", 1),
("in", 1),
("Wonderland", 1),
("Project", 1),
("Gutenberg’s", 1))
val rdd=spark.sparkContext.parallelize(data)
val rdd2=rdd.reduceByKey(_ + _)
rdd2.foreach(println)
}
Conclusion
In this article, you have learned Spark RDD reduceByKey()
transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.
Happy Learning !!
Scala, not python
Thanks for reading Spark reduceByKey() example. I’ve corrected it.
Nice approach you can, create another example with combineByKey, reduceByKey, and flatMapValues
Hi Adelin. Thanks for reading. sure will create articles for combineByKey, reduceByKey, and flatMapValues.