PySpark Count Distinct from DataFrame

In PySpark, you can use distinct().count() of DataFrame or countDistinct() SQL function to get the count distinct.

distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame. By chaining these you can get the count distinct of PySpark DataFrame.

countDistinct() is a SQL function that could be used to get the count distinct of the selected columns.

Let’s see these two ways with examples.

Before we start, first let’s create a DataFrame with some duplicate rows and duplicate values in a column.


from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.show()

Yields below output


+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

Using DataFrame distinct() and count()

On the above DataFrame, we have a total of 10 rows and one row with all values duplicated, performing distinct count ( distinct().count() ) on this DataFrame should get us 9.


print("Distinct Count: " + str(df.distinct().count()))

This yields output “Distinct Count: 9”

Using countDistinct() SQL Function

DataFrame distinct() returns a new DataFrame after eliminating duplicate rows (distinct on all columns). if you want to get count distinct on selected columns, use the PySpark SQL function countDistinct(). This function returns the number of distinct elements in a group.

In order to use this function, you need to import it first.


from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("department", "salary"))
df2.show()

Yields below output


+----------------------------------+
|count(DISTINCT department, salary)|
+----------------------------------+
|8                                 |
+----------------------------------+

Note that countDistinct() function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. And this function can be used to get the distinct count of any number of selected or all columns.


print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))

This outputs “Distinct Count of Department & Salary: 8”

Using SQL to get Count Distinct


df.createOrReplaceTempView("EMP")
spark.sql("select distinct(count(*)) from EMP").show()

# Displays this on console
+--------+
|count(1)|
+--------+
|      10|
+--------+

Source Code of PySpark Count Distinct Example


from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.distinct().show()
print("Distinct Count: " + str(df.distinct().count()))

# Using countDistrinct()
from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("Dept","Salary"))
df2.show()

print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))

Conclusion

In this article, you have learned how to get count distinct of all columns or selected columns on PySpark DataFrame.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply