PySpark Count Distinct from DataFrame

  • Post author:
  • Post category:PySpark
  • Post last modified:November 29, 2023
  • Reading time:7 mins read

In PySpark, you can use distinct().count() of DataFrame or countDistinct() SQL function to get the count distinct.

distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame. By chaining these you can get the count distinct of PySpark DataFrame.

countDistinct() is a SQL function that could be used to get the count distinct of the selected multiple columns.

Let’s see these two ways with examples.

Before we start, first let’s create a DataFrame with some duplicate rows and duplicate values in a column.


# Create SparkSession and Prepare Data
from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.show()

Yields below output


# Output
+-------+---------+------+
|   Name|     Dept|Salary|
+-------+---------+------+
|  James|    Sales|  3000|
|Michael|    Sales|  4600|
| Robert|    Sales|  4100|
|  Maria|  Finance|  3000|
|  James|    Sales|  3000|
|  Scott|  Finance|  3300|
|    Jen|  Finance|  3900|
|   Jeff|Marketing|  3000|
|  Kumar|Marketing|  2000|
|   Saif|    Sales|  4100|
+-------+---------+------+

1. Using DataFrame distinct() and count()

On the above DataFrame, we have a total of 10 rows and one row with all values duplicated, performing distinct counts ( distinct().count() ) on this DataFrame should get us 9.


# Applying distinct() and count()
df1 = df.distinct()
print(df1.count())
df1.show()

This yields output “Distinct Count: 9”


# Output
9
+-------+---------+------+
|   Name|     Dept|Salary|
+-------+---------+------+
|Michael|    Sales|  4600|
|  James|    Sales|  3000|
| Robert|    Sales|  4100|
|  Scott|  Finance|  3300|
|  Maria|  Finance|  3000|
|    Jen|  Finance|  3900|
|  Kumar|Marketing|  2000|
|   Jeff|Marketing|  3000|
|   Saif|    Sales|  4100|
+-------+---------+------+

Below is an example to get Distinct values from a single column and then apply count() to get the count of distinct values.


# Apply distinct() and count() on a single column
df2 = df.select("Name").distinct()
print(df2.count())
df2.show()

# Output
9
+-------+
|   Name|
+-------+
|  James|
|Michael|
| Robert|
|  Scott|
|  Maria|
|    Jen|
|  Kumar|
|   Saif|
|   Jeff|
+-------+

If you want to count distinct values based on multiple columns, you can pass multiple column names to the select method. Example.


# Applying distinct(), count()  on multiple columns 
df3 = df.select("Name", "Dept").distinct().count()
print(df3)

# Output
9

2. Using countDistinct() SQL Function

DataFrame distinct() returns a new DataFrame after eliminating duplicate rows (distinct on all columns). if you want to get count distinct on selected multiple columns, use the PySpark SQL function countDistinct(). This function returns the number of distinct elements in a group.

In order to use this function, you need to import it first.


from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("Dept", "Salary"))
df2.show()

Yields below output


# Output
+----------------------------------+
|count(DISTINCT Dept, Salary)|
+----------------------------------+
|8                                 |
+----------------------------------+

Note that countDistinct() function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. This function can be used to get the distinct count of any number of selected or all columns.


# Applying collect() after countDistinct()
print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))

This outputs “Distinct Count of Department & Salary: 8”

3. Using SQL to get Count Distinct


df.createOrReplaceTempView("EMP")
spark.sql("select distinct(count(*)) from EMP").show()

# Displays this on console
+--------+
|count(1)|
+--------+
|      10|
+--------+

4. Source Code of PySpark Count Distinct Example


from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.distinct().show()
print("Distinct Count: " + str(df.distinct().count()))

# Using countDistrinct()
from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("Dept","Salary"))
df2.show()

print("Distinct Count of Department & Salary: "+ str(df2.collect()[0][0]))

5. Conclusion

In this article, you have learned how to get a count distinct from all columns or selected multiple columns on PySpark DataFrame.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply