Spark SQL – Count Distinct from DataFrame

In this Spark SQL tutorial, you will learn different ways to count the distinct values in every column or selected columns of rows in a DataFrame using methods available on DataFrame and SQL function using Scala examples.

Before we start, first let’s create a DataFrame with some duplicate rows and duplicate values in a column.


  import spark.implicits._

  val simpleData = Seq(("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  )
  val df = simpleData.toDF("employee_name", "department", "salary")
  df.show()

Yields below output


+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

Using DataFrame Count Distinct

On the above DataFrame, we have a total of 9 rows and one row with all values duplicated, performing distinct count ( distinct().count() ) on this DataFrame should get us 8.


 println("Distinct Count: " + df.distinct().count())

This yields output “Distinct Count: 8”

Using SQL Count Distinct

distinct() runs distinct on all columns, if you want to get count distinct on selected columns, use the Spark SQL function countDistinct(). This function returns the number of distinct elements in a group.

In order to use this function, you need to import first using, "import org.apache.spark.sql.functions.countDistinct"


  val df2 = df.select(countDistinct("department", "salary"))
  df2.show(false)

Yields below output


+----------------------------------+
|count(DISTINCT department, salary)|
+----------------------------------+
|8                                 |
+----------------------------------+

Note that countDistinct() function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. And this function can be used to get the distinct count of any number of columns.


println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))

This outputs “Distinct Count of Department & Salary: 8”

Source Code of SQL Count Distinct Example


package com.sparkbyexamples.spark.dataframe.functions.aggregate

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DistinctCount extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  import spark.implicits._

  val simpleData = Seq(("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  )
  val df = simpleData.toDF("employee_name", "department", "salary")
  df.show()

  println("Distinct Count: " + df.distinct().count())

  val df2 = df.select(countDistinct("department", "salary"))
  df2.show(false)
  println("Distinct Count of Department & Salary: "+df2.collect()(0)(0))

}

The complete example is available at GitHub for reference.

Conclusion

In this article, you have learned how to get count distinct of all columns or selected columns on DataFrame using Spark SQL functions.

Happy Learning !!

NNK

SparkByExamples.com is a BigData and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment using Scala and Maven.

Leave a Reply