PySpark GroupBy Count – Explained

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

In this article, I will explain how to use groupBy() and count() aggregate together with examples. PySpark groupBy() function is used to collect the identical data into groups and perform aggregate functions like size/count on the grouped data.

1. Quick Examples of Groupby Count

Following are quick examples of how to perform groupby count.


# Quick Examples
# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

First, let’s create a DataFrame


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

Yields below output.

pyspark groupby count

2. PySpark Groupby Count Example

By using DataFrame.groupBy().count() in PySpark you can get the number of rows for each group. DataFrame.groupBy() function returns a pyspark.sql.GroupedData object which contains a set of methods to perform aggregations on a DataFrame.


# PySpark groupBy() count
df2 = df.groupBy("department").count()
df2.show()

Yields below output.

pyspark sql groupby count

3. PySpark Groupby Count on Multiple Columns

Groupby Count on Multiple Columns can be performed by passing two or more columns to the function and using the count() on top of the result. The following example performs grouping on department and state columns and on the result, I have used the count() function.


# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

Yields below output.

pyspark groupby count multiple columns

4. PySpark SQL GROUP BY COUNT

Finally, With PySpark you can also run SQL queries hence, let’s run the above example in SQL by running SELECT GROUP BY Count(*).

In order to do so, first, you need to create a temporary view by using createOrReplaceTempView() and use SparkSession.sql() to run the query. The table would be available to use until you end your SparkSession.


# PySpark SQL Group By Count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"
spark.sql(sql_str).show()

5. Complete Example of Groupby Count

Following is a complete example of the groupBy() and count().


# Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com') \
        .master("local[5]").getOrCreate()

# Create DataFrame
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.show()

# groupBy on single column
df2 = df.groupBy("department").count()
df2.show()

# groupBy on multiple columns
df2 = df.groupBy("department","state").count()
df2.show()

# SQL group by count
# Create Temporary table in PySpark
df.createOrReplaceTempView("EMP")

# PySpark SQL
sql_str="select department, state, count(*) as count from EMP " + \
"group by department, state"spark.sql(sql_str).show()

6. Conclusion

PySpark DataFrame.groupBy().count() is used to get the aggregate number of rows for each group, by using this you can calculate the size on single and multiple columns. You can also get a count per group by using PySpark SQL, in order to use SQL, first you need to create a temporary view.

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark GroupBy Count – Explained