Spark – How to Sort DataFrame column explained

Spread the love

In Spark, you can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples.

Before we start, first let’s create a DataFrame.


  import spark.implicits._

  val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
  val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
  df.show()

This Yields below output.


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: integer (nullable = false)
 |-- age: integer (nullable = false)
 |-- bonus: integer (nullable = false)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|        James|     Sales|   NY| 90000| 34|10000|
|      Michael|     Sales|   NY| 86000| 56|20000|
|       Robert|     Sales|   CA| 81000| 30|23000|
|        Maria|   Finance|   CA| 90000| 24|23000|
|        Raman|   Finance|   CA| 99000| 40|24000|
|        Scott|   Finance|   NY| 83000| 36|19000|
|          Jen|   Finance|   NY| 79000| 53|15000|
|         Jeff| Marketing|   CA| 80000| 25|18000|
|        Kumar| Marketing|   NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+

DataFrame sorting using the sort() function

Spark DataFrame/Dataset class provides sort() function to sort on one or more columns. By default, it sorts by ascending order.

Syntax


sort(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
sort(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]

Example


df.sort("department","state").show(false)
df.sort(col("department"),col("state")).show(false)

The above two examples return the same below output, the first one takes the DataFrame column name as a string and the next takes columns in Column type. This table sorted by the first department column and then the state column.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

DataFrame sorting using orderBy() function

Alternatively, Spark DataFrame/Dataset class also provides orderBy() function to sort on one or more columns. By default, it also orders by ascending.

Syntax


orderBy(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
orderBy(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]

Example


df.orderBy("department","state").show(false)
df.orderBy(col("department"),col("state")).show(false)

This returns the same output as the previous section.

Sort by Ascending (ASC)

If you wanted to specify the ascending order/sort explicitly on DataFrame, you can use the <a href="https://sparkbyexamples.com/spark/spark-sql-sort-functions/#asc">asc</a> method of the Column function. for example


df.sort(col("department").asc,col("state").asc).show(false)
df.orderBy(col("department").asc,col("state").asc).show(false)

The above two examples return the same output.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

Sort by Descending (DESC)

If you wanted to specify the sorting by descending order on DataFrame, you can use the <a href="https://sparkbyexamples.com/spark/spark-sql-sort-functions/#desc">desc</a> method of the Column function. for example


df.sort(col("department").asc,col("state").desc).show(false)
df.orderBy(col("department").asc,col("state").desc).show(false)

This yields the below output.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
+-------------+----------+-----+------+---+-----+

Using Sorting functions

Spark SQL function provides several sorting functions, below are some examples of how to use asc and desc functions. Besides these Spark also provides asc_nulls_first and asc_nulls_last functions and equivalent for descending.


df.select($"employee_name",asc("department"),desc("state"),$"salary",$"age",$"bonus").show(false)

Using Sorting on SQL


df.createOrReplaceTempView("EMP")
spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)

The above two examples return the same output as above.

Dataframe Sorting Complete Example


package com.sparkbyexamples.spark.dataframe.functions

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortExample extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  import spark.implicits._

  val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
  val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
  df.printSchema()
  df.show()

  df.sort("department","state").show(false)
  df.sort(col("department"),col("state")).show(false)

  df.orderBy("department","state").show(false)
  df.orderBy(col("department"),col("state")).show(false)

  df.sort(col("department").asc,col("state").asc).show(false)
  df.orderBy(col("department").asc,col("state").asc).show(false)

  df.sort(col("department").asc,col("state").desc).show(false)
  df.orderBy(col("department").asc,col("state").desc).show(false)
 df.select($"employee_name",asc("department"),desc("state"),$"salary",$"age",$"bonus").show(false)
  df.createOrReplaceTempView("EMP")
  spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)

}

This complete example is also available at GitHub project for reference.

Conclusion

Here you have learned how to Sort Spark DataFrame columns using sort(), orderBy() and using SQL sort functions and used this function with Spark SQL along with Ascending and Descending sorting orders.

Happy Learning !!

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark – How to Sort DataFrame column explained