You are currently viewing PySpark Groupby Count Distinct

By using countDistinct() PySpark SQL function you can get the count distinct of the DataFrame that resulted from PySpark groupBy(). countDistinct() is used to get the count of unique values of the specified column.

Advertisements

When you perform group by, the data having the same key are shuffled and brought together. Since it involves the data crawling across the network, group by is considered a wider transformation. In this article, I will explain how to count distinct values of the column after groupBy() in PySpark Dataframe.

1. Quick Examples of Groupby Count Distinct

Following are quick examples of groupby count distinct.


# groupby columns & countDistinct
df.groupBy("department").agg(countDistinct('state')) \
    .show(truncate=False)

# Using groupby() and count()
df.groupBy("department","state").count().groupBy("department").count()

# Using SQL
df.createOrReplaceTempView("EMP")
spark.sql("SELECT department, count(DISTINCT state) as count" + \
"FROM EMP group by department").show()

Let’s create a PySpark DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show(truncate=False)

Yields below output.

PySpark Groupby Count Distinct

2. PySpark Groupby Count Distinct

From the PySpark DataFrame, let’s get the distinct count (unique count) of state‘s for each department, in order to get this first, we need to perform the groupBy() on department column and on top of the group result perform avg(countDistinct()) on the state column.

In order to use countDistinct() method first, you need to import it from pyspark.sql.functions


# groupby columns & countDistinct
from pyspark.sql.functions import countDistinct
df.groupBy("department").agg(countDistinct('state')) \
    .show(truncate=False)

Yields below output. If you look at our data we have 2 distinct states for each department. hence, the below result.


# Output
+----------+------------+
|department|count(state)|
+----------+------------+
|Sales     |2           |
|Finance   |2           |
|Marketing |2           |
+----------+------------+

If you notice the distinct count column name is count(state), you can change the column name after group by using an alias.


+----------+-----+
|department|count|
+----------+-----+
|Sales     |2    |
|Finance   |2    |
|Marketing |2    |
+----------+-----+

3. Using groupby() & count()

To calculate the count of unique values of the group by the result, first, run the PySpark groupby() on two columns and then perform the count and again perform groupby.

This solution is not suggestible to use as it impacts the performance of the query when running on billions of events.


# Using groupby() and count()
df.groupBy("department","state").count().groupBy("department").count()

Yields the same output as above.

3. Using SQL

Let’s run the distinct count example using ANSI SQL query.


df.createOrReplaceTempView("EMP")
    
spark.sql("SELECT department, count(DISTINCT state) as count" + \
"FROM EMP group by department").show()

4. Conclusion

In this PySpark article, you have learned how to get the number of unique values of groupBy results by using countDistinct(), distinct().count() and SQL . All these methods are used to get the count of distinct values of the specified column and apply this to group by results to get Groupby Count Distinct.

References