You are currently viewing Spark sortByKey() with RDD Example

Spark sortByKey() transformation is an RDD operation that is used to sort the values of the key by ascending or descending order. sortByKey() function operates on pair RDD (key/value pair) and it is available in <a href="https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala">org.apache.spark.rdd.OrderedRDDFunctions</a>.

First, let’s create an RDD from the list.


  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project","A", 1),
    ("Gutenberg’s", "X",3),
    ("Alice’s", "C",5),
    ("Adventures","B", 1)
  )
  val rdd=spark.sparkContext.parallelize(data)

As you see the data here, it’s in key/value pair. Key is the work name and value is the count.

Related:

Spark RDD sortByKey() Syntax

Below is the syntax of the Spark RDD sortByKey() transformation, this returns Tuple2 after sorting the data.


sortByKey(ascending:Boolean,numPartitions:int):org.apache.spark.rdd.RDD[scala.Tuple2[K, V]] 

This function takes two optional arguments; ascending as Boolean and numPartitions as an integer.

ascending is used to specify the order of the sort, by default, it is true meaning ascending order, use false for descending order.

numPartitions is used to specify the number of partitions it should create with the result of the sortByKey() function.

RDD sortByKey() Example

On our input the RDD is not in Pair RDD (key/value pair) hence we cannot apply sortByKey() transformation so first you need to convert this RDD into Pair RDD.

By using Spark RDD map transformation you can conver RDD into Pair RDD. I would like to sort on index 2 of the RDD.

val rdd2=rdd.map(f=>{(f._2, (f._1,f._2,f._3))}) rdd2.foreach(println) // Prints to console (A,(Project,A,1)) (X,(Gutenberg’s,X,3)) (C,(Alice’s,C,5)) (B,(Adventures,B,1))

Now let’s use the sortByKey() to sort.


 val rdd3= rdd2.sortByKey()
 rdd3.foreach(println)

Since I have not used any arguments for sorting by default it sorts in ascending order. This yields the below output in the console.

spark sortbykey rdd example
Spark sortByKey() result

Below example sorts in descending order.


val rdd4= rdd2.sortByKey(false)
rdd4.foreach(println)
// Prints to console
(X,(Gutenberg’s,X,3))
(C,(Alice’s,C,5))
(B,(Adventures,B,1))
(A,(Project,A,1))

Complete sortByKey() Scala Example

Below is a complete example of RDD sortByKey() transformation with Scala Example.


import org.apache.spark.sql.SparkSession
object SortByKeyExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  val data = Seq(("Project","A", 1),
    ("Gutenberg’s", "X",3),
    ("Alice’s", "C",5),
    ("Adventures","B", 1)
  )

  val rdd=spark.sparkContext.parallelize(data)
  rdd.foreach(println)
  val rdd2=rdd.map(f=>{(f._2, (f._1,f._2,f._3))})
  rdd2.foreach(println)
  val rdd3= rdd2.sortByKey()
  rdd2.foreach(println)
}

Conclusion

In this article, you have learned Spark RDD sortByKey() transformation to sort RDD in ascending or descending order. If RDD is non in Pair RDD you need to convert it using map transformation before calling the sortByKey() function.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply