You are currently viewing PySpark GroupBy Count – Explained


PySparks GroupBy Count function is used to get the total number of records within each group. To execute the count operation, you must initially apply the groupBy() method on the DataFrame, which groups the records based on singular or multiple-column values. Subsequently, the count() function is used to get the number of records within each group.

Advertisements

Key Points:

  • Use the count() function within the GroupBy operation to calculate the number of records within each group.
  • Specify the column(s) to group by within the groupBy() operation. This could be a single column or multiple columns based on your requirements.
  • If desired, you can optionally sort the results based on the count or other criteria to analyze the data more effectively.
  • Ensure that the DataFrame is properly prepared and cleaned before performing the GroupBy operation to obtain accurate counts per group.

1. Quick Examples of Group Count

Quick examples of getting a total number of records within each group.


# Quick Examples
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

To explain how to get the groupby count, create the following DataFrame.


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

Yields below output.

pyspark groupby count

2. GroupBy Count in PySpark

To get the groupby count on PySpark DataFrame, first apply the groupBy() method on the DataFrame, specifying the column you want to group by, and then use the count() function within the GroupBy operation to calculate the number of records within each group.

The groupBy() method returns the pyspark.sql.GroupedData, and this contains the count() function to ge the aggregations.


# PySpark groupBy() count
df2 = df.groupBy("department").count()
df2.show()

Yields below output.

pyspark sql groupby count

3. Groupby Count on Multiple Columns

To get the Group by count on multiple columns, pass two or more columns to the groupBy() function and use the count() to get the result. Here is an example.


# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

Here, groupBy("department","state").count() groups the DataFrame df by the “department” and “state” columns and get the count of records in each group. The result is stored in DataFrame df2 containing columns “department”, “state”, and “count”.

pyspark groupby count multiple columns

4. PySpark SQL GROUP BY COUNT

Finally, With PySpark you can also run SQL queries hence, let’s run the above example in SQL by running SELECT GROUP BY Count(*).

Using createOrReplaceTempView(), register DataFrame as a temporary view and use SparkSession.sql() to run the query. The table is scoped to the SparkSession. It is not accessible outside the SparkSession in which it was created. It will be removed automatically when the SparkSession terminates or when explicitly dropped using the spark.catalog.dropTempView() method.


# PySpark SQL Group By Count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

5. Complete Example of Groupby Count

Following is a complete example of the groupBy() and count().


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"spark.sql(sql_str).show()

6. Frequently Asked Questions

What are some common aggregation functions used with groupBy in PySpark?

Common aggregation functions include avg, sum, min, max, count, first, last, and custom aggregation functions defined using pyspark.sql.functions.

How to sort the result of a groupBy transformation in PySpark?

We can use the orderBy method after groupBy operation to sort the result based on one or more columns.
For example:
df.groupBy(“col1”).agg({“col2”: “avg”}).orderBy(“col1”)

Can we use a user-defined function (UDF) with groupBy in PySpark?

We can use a User-Defined Function (UDF) with groupBy by registering the UDF and applying it as part of the aggregation functions.

7. Conclusion

Use groupBy().count() to get the number of rows within each group. In conclusion, PySpark’s GROUP BY COUNT operation offers a powerful mechanism for aggregating and analyzing data based on specified criteria. By grouping data and computing counts within each group, users can gain valuable insights into the distribution and frequency of occurrences within their datasets.

I hope you have learned to get a number of records for each group by single and multiple columns and get GROUPBY COUNT by running an SQL query.

References