• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:8 mins read
You are currently viewing PySpark Groupby Count Distinct

By using countDistinct() PySpark SQL function you can get the count distinct of the DataFrame that resulted from PySpark groupBy(). countDistinct() is used to get the count of unique values of the specified column.

When you perform group by, the data having the same key are shuffled and brought together. Since it involves the data crawling across the network, group by is considered a wider transformation. In this article, I will explain how to count distinct values of the column after groupBy() in PySpark Dataframe.

1. Quick Examples of Groupby Count Distinct

Following are quick examples of groupby count distinct.


# groupby columns & countDistinct
df.groupBy("department").agg(countDistinct('state')) \
    .show(truncate=False)

# Using groupby() and count()
df.groupBy("department","state").count().groupBy("department").count()

# Using SQL
df.createOrReplaceTempView("EMP")
spark.sql("SELECT department, count(DISTINCT state) as count" + \
"FROM EMP group by department").show()

Let’s create a PySpark DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show(truncate=False)

Yields below output.

PySpark Groupby Count Distinct

2. PySpark Groupby Count Distinct

From the PySpark DataFrame, let’s get the distinct count (unique count) of state‘s for each department, in order to get this first, we need to perform the groupBy() on department column and on top of the group result perform avg(countDistinct()) on the state column.

In order to use countDistinct() method first, you need to import it from pyspark.sql.functions


# groupby columns & countDistinct
from pyspark.sql.functions import countDistinct
df.groupBy("department").agg(countDistinct('state')) \
    .show(truncate=False)

Yields below output. If you look at our data we have 2 distinct states for each department. hence, the below result.


# Output
+----------+------------+
|department|count(state)|
+----------+------------+
|Sales     |2           |
|Finance   |2           |
|Marketing |2           |
+----------+------------+

If you notice the distinct count column name is count(state), you can change the column name after group by using an alias.


+----------+-----+
|department|count|
+----------+-----+
|Sales     |2    |
|Finance   |2    |
|Marketing |2    |
+----------+-----+

3. Using groupby() & count()

To calculate the count of unique values of the group by the result, first, run the PySpark groupby() on two columns and then perform the count and again perform groupby.

This solution is not suggestible to use as it impacts the performance of the query when running on billions of events.


# Using groupby() and count()
df.groupBy("department","state").count().groupBy("department").count()

Yields the same output as above.

3. Using SQL

Finally, let’s convert the above code into the PySpark SQL query to get the group by distinct count. In order to do so, first, you need to create a temporary view by using createOrReplaceTempView() and use SparkSession.sql() to run the query. The table would be available to use until you end your SparkSession.


df.createOrReplaceTempView("EMP")
    
spark.sql("SELECT department, count(DISTINCT state) as count" + \
"FROM EMP group by department").show()

4. Conclusion

In this PySpark article, you have learned how to get the number of unique values of groupBy results by using countDistinct(), distinct().count() and SQL . All these methods are used to get the count of distinct values of the specified column and apply this to group by results to get Groupby Count Distinct.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium