PySpark count() - Different Methods Explained

In PySpark, the count() method is an action operation that is used to count the number of elements in a distributed dataset, represented as an RDD (Resilient Distributed Dataset) or a DataFrame. Actions are operations that trigger computation on RDDs or DataFrames and return a result to the driver program or write data to an external storage system. Examples of actions include collect(), take(), reduce(), and count().

1. Quick Examples

Following are quick examples of different count functions.


# Quick Examples of PySpark count()

# Get row count
rows = empDF.count()

# Get columns count
cols = len(empDF.columns)

print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")

unique_count = empDF.distinct().count()
print(f"DataFrame Distinct count : {unique_count}")

# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# using agg
empDF.agg({'name':'count','gender':'count'}).show()

# groupby count
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Let’s create a DataFrame


# Import pyspark 
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

Yields below output

2. DataFrame.count()

pyspark.sql.DataFrame.count() function is used to get the number of rows present in the DataFrame. count() is an action operation that triggers the transformations to execute. Since transformations are lazy in nature they do not get executed until we call an action().

In the below example, empDF is a DataFrame object, and below is the detailed explanation.

DataFrame.count() -Returns the number of records in a DataFrame.
DataFrame.columns – Returns all column names of a DataFrame as a list.
len() – len() is a Python function that returns a number of elements present in a list.
len(DataFrame.columns) – Returns the number of columns in a DataFrame.


# Get row count
rows = empDF.count()

# Get columns count
cols = len(empDF.columns)

print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")

Yields below output.

2.1. Unique count

DataFrame.distinct() function gets the distinct rows from the DataFrame by eliminating all duplicates and on top of that use count() function to get the distinct count of records.


# Unique count
unique_count = empDF.distinct().count()
print(f"DataFrame Distinct count : {unique_count}")

3. functions.count()

pyspark.sql.functions.count() is a function provided by the PySpark SQL module (pyspark.sql.functions) that allows you to count the number of non-null values in a column of a DataFrame. It operates on DataFrame columns and returns the count of non-null values within the specified column.

functions.count() takes one parameter, which is the column on which you want to count the non-null values. You can specify the column either by its name (as a string) or by using the column reference.


# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

In the above example,

select() is used to select the columns from the DataFrame.
empDF.name refers to the empDF DataFrame and name column in it.
count(empDF.name) returns the number of values in a column named name.

Yields below output.

5. GroupedData.count()

The GroupedData.count() is a method provided by PySpark’s DataFrame API that allows you to count the number of rows in each group after applying a groupBy() operation on a DataFrame. It returns a new DataFrame containing the counts of rows for each group.

Here’s how GroupedData.count() works:

Grouping: Before using count(), you typically apply a groupBy() operation on the DataFrame to group the rows based on one or more columns.
Counting: Once the DataFrame is grouped, you can call count() on the resulting GroupedData object. This counts the number of rows in each group and returns the counts as a new DataFrame with two columns: the grouping column(s) and the count of rows for each group.


# Count on Groupped Data
empDF.groupBy("dept_id").count().show()

Yields below output.

6. Agg count

DataFrame.agg() is a method in PySpark’s DataFrame API used for aggregating data. It allows you to compute aggregate functions on DataFrame columns after grouping or without grouping them.

Here’s how DataFrame.agg() works:

Specifying Aggregations: You provide one or more aggregation expressions to agg() as arguments. These expressions define the aggregate functions to be applied to the DataFrame columns. Aggregate functions can include built-in functions like count(), sum(), avg(), min(), max(), etc., as well as user-defined functions.
Grouping (Optional): If you want to perform aggregation on grouped data, you can first apply a groupBy() operation on the DataFrame to group the data based on one or more columns. The aggregation functions will then be applied to each group separately.


# Using agg count
empDF.agg({'gender':'count','name':'count'}).show()

Yields below output.

7. PySpark SQL Count

You can use the SQL query language to perform various operations on DataFrames, including counting rows, grouping data, filtering data, and more. To count the number of rows in a DataFrame using SQL syntax, you can execute a SQL query with the COUNT function.

To run the SQL query, use spark.sql() function; it also returns a DataFrame.


# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Complete Example of PySpark Count

Below is the complete example of PySpark count with all different functions.


# Import pyspark 
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

# Get row count
rows = empDF.count()

# Get columns count
cols = len(empDF.columns)

print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")

# distinct count
distinct_count = empDF.distinct().count()

# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# using agg
empDF.agg({'name':'count','gender':'count'}).show()

# groupby count
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

9. Frequently Asked Questions on count()

What is the difference between count() and countDistinct()?

count() returns the total number of rows, while countDistinct() returns the number of distinct values in a column.

Are there approximate methods for counting in PySpark?

PySpark provides approxCountDistinct() for approximate distinct counting using the HyperLogLog algorithm, which can be faster for large datasets.

How can we count the occurrences of each value in a column?

We can use groupBy and count() to count the occurrences of each value in a column.
For Example:
counts = df.groupBy(“column_name”).count()

How can we count the number of null or non-null values in a column?

isNull() or isNotNull() functions combined with count() to count the number of null or non-null values in a column. For Example:
null_count = df.filter(df[‘column_name’].isNull()).count()
non_null_count = df.filter(df[‘column_name’].isNotNull()).count()

8. Conclusion

In conclusion, counting in PySpark is a fundamental operation that allows users to determine the size of datasets, perform data validation, and gain insights into the distribution of data across different groups. Through various methods such as count() for RDDs and DataFrames, functions.count() for counting non-null values in columns, and GroupedData.count() for counting rows after grouping, PySpark provides versatile tools for efficiently computing counts at scale.

References

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.GroupedData.html

1. Quick Examples

2. DataFrame.count()

2.1. Unique count

3. functions.count()

5. GroupedData.count()

6. Agg count

7. PySpark SQL Count

Complete Example of PySpark Count

9. Frequently Asked Questions on count()

8. Conclusion

Related Articles

References