PySparks GroupBy Count function is used to get the total number of records within each group. To execute the count operation, you must initially apply the groupBy() method on the DataFrame, which groups the records based on singular or multiple-column values. Subsequently, the count() function is used to get the number of records within each group.
Key Points:
- Use the count() function within the GroupBy operation to calculate the number of records within each group.
- Specify the column(s) to group by within the groupBy() operation. This could be a single column or multiple columns based on your requirements.
- If desired, you can optionally sort the results based on the count or other criteria to analyze the data more effectively.
- Ensure that the DataFrame is properly prepared and cleaned before performing the GroupBy operation to obtain accurate counts per group.
1. Quick Examples of Group Count
Quick examples of getting a total number of records within each group.
# Quick Examples
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()
To explain how to get the groupby count, create the following DataFrame.
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.master("local[5]").getOrCreate()
# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()
Yields below output.
2. GroupBy Count in PySpark
To get the groupby count on PySpark DataFrame, first apply the groupBy() method on the DataFrame, specifying the column you want to group by, and then use the count() function within the GroupBy operation to calculate the number of records within each group.
The groupBy() method returns the pyspark.sql.GroupedData, and this contains the count() function to ge the aggregations.
# PySpark groupBy() count
df2 = df.groupBy("department").count()
df2.show()
Yields below output.
3. Groupby Count on Multiple Columns
To get the Group by count on multiple columns, pass two or more columns to the groupBy() function and use the count() to get the result. Here is an example.
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
Here, groupBy("department","state").count()
groups the DataFrame df
by the “department” and “state” columns and get the count of records in each group. The result is stored in DataFrame df2
containing columns “department”, “state”, and “count”.
4. PySpark SQL GROUP BY COUNT
Finally, With PySpark you can also run SQL queries hence, let’s run the above example in SQL by running SELECT GROUP BY Count(*).
Using createOrReplaceTempView(), register DataFrame as a temporary view and use SparkSession.sql() to run the query. The table is scoped to the SparkSession. It is not accessible outside the SparkSession in which it was created. It will be removed automatically when the SparkSession terminates or when explicitly dropped using the spark.catalog.dropTempView()
method.
# PySpark SQL Group By Count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()
5. Complete Example of Groupby Count
Following is a complete example of the groupBy() and count().
# Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
.master("local[5]").getOrCreate()
# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()
# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()
# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")
# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"spark.sql(sql_str).show()
6. Frequently Asked Questions
groupBy
in PySpark? Common aggregation functions include avg
, sum
, min
, max
, count
, first
, last
, and custom aggregation functions defined using pyspark.sql.functions
.
groupBy
transformation in PySpark? We can use the orderBy
method after groupBy
operation to sort the result based on one or more columns.
For example:
df.groupBy(“col1”).agg({“col2”: “avg”}).orderBy(“col1”)
groupBy
in PySpark? We can use a User-Defined Function (UDF) with groupBy
by registering the UDF and applying it as part of the aggregation functions.
7. Conclusion
Use groupBy().count() to get the number of rows within each group. In conclusion, PySpark’s GROUP BY COUNT operation offers a powerful mechanism for aggregating and analyzing data based on specified criteria. By grouping data and computing counts within each group, users can gain valuable insights into the distribution and frequency of occurrences within their datasets.
I hope you have learned to get a number of records for each group by single and multiple columns and get GROUPBY COUNT by running an SQL query.
Related Articles
- PySpark Column alias after groupBy() Example
- PySpark Groupby Count Distinct
- PySpark Count of Non null, nan Values in DataFrame
- PySpark Groupby on Multiple Columns
- PySpark Count Distinct from DataFrame
- PySpark Groupby Agg (aggregate) – Explained
- PySpark – Find Count of null, None, NaN Values
- PySpark DataFrame groupBy and Sort by Descending Order