PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.
In this article, I will explain how to use groupBy() and count() aggregate together with examples. PySpark groupBy() function is used to collect the identical data into groups and perform aggregate functions like size/count on the grouped data.
1. Quick Examples of Groupby Count
Following are quick examples of how to perform groupby count.
# Quick Examples
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()
First, let’s create a DataFrame
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.master("local[5]").getOrCreate()
# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()
Yields below output.

2. PySpark Groupby Count Example
By using DataFrame.groupBy().count() in PySpark you can get the number of rows for each group. DataFrame.groupBy() function returns a pyspark.sql.GroupedData object which contains a set of methods to perform aggregations on a DataFrame.
# PySpark groupBy() count
df2 = df.groupBy("department").count()
df2.show()
Yields below output.

3. PySpark Groupby Count on Multiple Columns
Groupby Count on Multiple Columns can be performed by passing two or more columns to the function and using the count() on top of the result. The following example performs grouping on department
and state
columns and on the result, I have used the count() function.
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
Yields below output.

4. PySpark SQL GROUP BY COUNT
Finally, With PySpark you can also run SQL queries hence, let’s run the above example in SQL by running SELECT GROUP BY Count(*).
In order to do so, first, you need to create a temporary view by using createOrReplaceTempView() and use SparkSession.sql() to run the query. The table would be available to use until you end your SparkSession.
# PySpark SQL Group By Count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()
5. Complete Example of Groupby Count
Following is a complete example of the groupBy() and count().
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.master("local[5]").getOrCreate()
# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"spark.sql(sql_str).show()
6. Conclusion
PySpark DataFrame.groupBy().count() is used to get the aggregate number of rows for each group, by using this you can calculate the size on single and multiple columns. You can also get a count per group by using PySpark SQL, in order to use SQL, first you need to create a temporary view.
Related Articles
- PySpark Column alias after groupBy() Example
- PySpark DataFrame groupBy and Sort by Descending Order
- PySpark Count of Non null, nan Values in DataFrame
- PySpark Count Distinct from DataFrame
- PySpark – Find Count of null, None, NaN Values
- PySpark Groupby Agg (aggregate) – Explained
- PySpark Groupby on Multiple Columns
- PySpark Groupby Count Distinct