• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:12 mins read
You are currently viewing PySpark GroupBy Count – Explained

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

In this article, I will explain how to use groupBy() and count() aggregate together with examples. PySpark groupBy() function is used to collect identical data into groups and perform aggregate functions like size/count on the grouped data.

1. Quick Examples of Groupby Count

Following are quick examples of how to perform groupby() count.


# Quick Examples
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

First, let’s create a DataFrame


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

Yields below output.

pyspark groupby count

2. PySpark GroupBy Count Example

By using DataFrame.groupBy().count() in PySpark you can get the number of rows for each group. DataFrame.groupBy() function returns a pyspark.sql.GroupedData object which contains a set of methods to perform aggregations on a DataFrame.


# PySpark groupBy() count
df2 = df.groupBy("department").count()
df2.show()

Yields below output.

pyspark sql groupby count

3. PySpark Groupby Count on Multiple Columns

GroupBy Count on Multiple Columns can be performed by passing two or more columns to the function and using the count() on top of the result. The following example performs grouping on department and state columns and on the result, I have used the count() function.


# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

Yields below output.

pyspark groupby count multiple columns

4. PySpark SQL GROUP BY COUNT

Finally, With PySpark you can also run SQL queries hence, let’s run the above example in SQL by running SELECT GROUP BY Count(*).

To do so, first, you need to create a temporary view by using createOrReplaceTempView() and use SparkSession.sql() to run the query. The table will be available to use until you end your SparkSession.


# PySpark SQL Group By Count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

5. Complete Example of Groupby Count

Following is a complete example of the groupBy() and count().


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"spark.sql(sql_str).show()

Frequently Asked Questions on groupBy()

What are some common aggregation functions used with groupBy in PySpark?

Common aggregation functions include avg, sum, min, max, count, first, last, and custom aggregation functions defined using pyspark.sql.functions.

How to sort the result of a groupBy transformation in PySpark?

We can use the orderBy method after groupBy operation to sort the result based on one or more columns.
For example:
df.groupBy(“col1”).agg({“col2”: “avg”}).orderBy(“col1”)

Can we use a user-defined function (UDF) with groupBy in PySpark?

We can use a User-Defined Function (UDF) with groupBy by registering the UDF and applying it as part of the aggregation functions.

6. Conclusion

PySpark DataFrame.groupBy().count() is used to get the aggregate number of rows for each group, by using this you can calculate the size of single and multiple columns. You can also get a count per group by using PySpark SQL, to use SQL, first, you need to create a temporary view.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium