You are currently viewing Spark SQL – Count Distinct from DataFrame

In this Spark SQL tutorial, you will learn different ways to count the distinct values in every column or selected columns of rows in a DataFrame using methods available on DataFrame and SQL function using Scala examples.

Before we start, first let’s create a DataFrame with some duplicate rows and duplicate values in a column.


  import spark.implicits._

  val simpleData = Seq(("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  )
  val df = simpleData.toDF("employee_name", "department", "salary")
  df.show()

Yields below output


+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

Using DataFrame Count Distinct

On the above DataFrame, we have a total of 9 rows and one row with all values duplicated, performing distinct count ( distinct().count() ) on this DataFrame should get us 8.


 println("Distinct Count: " + df.distinct().count())

This yields output “Distinct Count: 8”

Using SQL Count Distinct

distinct() runs distinct on all columns, if you want to get count distinct on selected columns, use the Spark SQL function countDistinct(). This function returns the number of distinct elements in a group.

In order to use this function, you need to import first using, "import org.apache.spark.sql.functions.countDistinct"


  val df2 = df.select(countDistinct("department", "salary"))
  df2.show(false)

Yields below output


+----------------------------------+
|count(DISTINCT department, salary)|
+----------------------------------+
|8                                 |
+----------------------------------+

Note that countDistinct() function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. And this function can be used to get the distinct count of any number of columns.


println("Distinct Count of Department & Salary: "+ df2.collect()(0)(0))

This outputs “Distinct Count of Department & Salary: 8”

Source Code of SQL Count Distinct Example


package com.sparkbyexamples.spark.dataframe.functions.aggregate

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DistinctCount extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  import spark.implicits._

  val simpleData = Seq(("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  )
  val df = simpleData.toDF("employee_name", "department", "salary")
  df.show()

  println("Distinct Count: " + df.distinct().count())

  val df2 = df.select(countDistinct("department", "salary"))
  df2.show(false)
  println("Distinct Count of Department & Salary: "+df2.collect()(0)(0))

}

The complete example is available at GitHub for reference.

Conclusion

In this article, you have learned how to get count distinct of all columns or selected columns on DataFrame using Spark SQL functions.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium