In PySpark, the count()
method is an action operation that is used to count the number of elements in a distributed dataset, represented as an RDD (Resilient Distributed Dataset) or a DataFrame. Actions are operations that trigger computation on RDDs or DataFrames and return a result to the driver program or write data to an external storage system. Examples of actions include collect()
, take()
, reduce()
, and count()
.
PySpark has several count() functions. Depending on your needs, you should choose which one best meets your needs. Note that calling count()
on a large dataset may trigger a time-consuming computation, especially if the dataset is partitioned across many nodes.
- pyspark.sql.DataFrame.count() – Get the count of rows in a DataFrame.
- pyspark.sql.functions.count() – Get the column value count or unique value count
- pyspark.sql.GroupedData.count() – Get the count of grouped data.
- SQL Query Count – Use ANSI SQL query to get the DataFrame count pr counts by group.
By using the count() function, you can get row count, column count, count values in column, get distinct count, get groupby count e.t.c.
1. Quick Examples
Following are quick examples of different count functions.
# Quick Examples of PySpark count()
# Get row count
rows = empDF.count()
# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")
unique_count = empDF.distinct().count()
print(f"DataFrame Distinct count : {unique_count}")
# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()
# using agg
empDF.agg({'name':'count','gender':'count'}).show()
# groupby count
empDF.groupBy("dept_id").count().show()
# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()
Let’s create a DataFrame
# Import pyspark
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
(2,"Rose",20,"M",4000),
(3,"Williams",10,"M",1000),
(4,"Jones",10,"F",2000),
(5,"Brown",30,"",-1),
(6,"Brown",30,"",-1)
]
empColumns = ["emp_id","name","dept_id",
"gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()
Yields below output
2. DataFrame.count()
pyspark.sql.DataFrame.count()
function is used to get the number of rows present in the DataFrame. count() is an action operation that triggers the transformations to execute. Since transformations are lazy in nature they do not get executed until we call an action().
In the below example, empDF is a DataFrame object, and below is the detailed explanation.
- DataFrame.count() -Returns the number of records in a DataFrame.
- DataFrame.columns – Returns all column names of a DataFrame as a list.
- len() – len() is a Python function that returns a number of elements present in a list.
- len(DataFrame.columns) – Returns the number of columns in a DataFrame.
# Get row count
rows = empDF.count()
# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")
Yields below output.
2.1. Unique count
DataFrame.distinct() function gets the distinct rows from the DataFrame by eliminating all duplicates and on top of that use count() function to get the distinct count of records.
# Unique count
unique_count = empDF.distinct().count()
print(f"DataFrame Distinct count : {unique_count}")
3. functions.count()
pyspark.sql.functions.count()
is a function provided by the PySpark SQL module (pyspark.sql.functions
) that allows you to count the number of non-null values in a column of a DataFrame. It operates on DataFrame columns and returns the count of non-null values within the specified column.
functions.count()
takes one parameter, which is the column on which you want to count the non-null values. You can specify the column either by its name (as a string) or by using the column reference.
# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()
In the above example,
- select() is used to select the columns from the DataFrame.
- empDF.name refers to the empDF DataFrame and name column in it.
- count(empDF.name) returns the number of values in a column named name.
Yields below output.
5. GroupedData.count()
The GroupedData.count()
is a method provided by PySpark’s DataFrame API that allows you to count the number of rows in each group after applying a groupBy()
operation on a DataFrame. It returns a new DataFrame containing the counts of rows for each group.
Here’s how GroupedData.count()
works:
- Grouping: Before using
count()
, you typically apply agroupBy()
operation on the DataFrame to group the rows based on one or more columns. - Counting: Once the DataFrame is grouped, you can call
count()
on the resultingGroupedData
object. This counts the number of rows in each group and returns the counts as a new DataFrame with two columns: the grouping column(s) and the count of rows for each group.
# Count on Groupped Data
empDF.groupBy("dept_id").count().show()
Yields below output.
6. Agg count
DataFrame.agg()
is a method in PySpark’s DataFrame API used for aggregating data. It allows you to compute aggregate functions on DataFrame columns after grouping or without grouping them.
Here’s how DataFrame.agg()
works:
- Specifying Aggregations: You provide one or more aggregation expressions to
agg()
as arguments. These expressions define the aggregate functions to be applied to the DataFrame columns. Aggregate functions can include built-in functions likecount()
,sum()
,avg()
,min()
,max()
, etc., as well as user-defined functions. - Grouping (Optional): If you want to perform aggregation on grouped data, you can first apply a
groupBy()
operation on the DataFrame to group the data based on one or more columns. The aggregation functions will then be applied to each group separately.
# Using agg count
empDF.agg({'gender':'count','name':'count'}).show()
Yields below output.
7. PySpark SQL Count
You can use the SQL query language to perform various operations on DataFrames, including counting rows, grouping data, filtering data, and more. To count the number of rows in a DataFrame using SQL syntax, you can execute a SQL query with the COUNT
function.
To run the SQL query, use spark.sql()
function; it also returns a DataFrame.
# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()
Complete Example of PySpark Count
Below is the complete example of PySpark count with all different functions.
# Import pyspark
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
(2,"Rose",20,"M",4000),
(3,"Williams",10,"M",1000),
(4,"Jones",10,"F",2000),
(5,"Brown",30,"",-1),
(6,"Brown",30,"",-1)
]
empColumns = ["emp_id","name","dept_id",
"gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()
# Get row count
rows = empDF.count()
# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Dimensions : {(rows,cols)}")
print(f"DataFrame Rows count : {rows}")
print(f"DataFrame Columns count : {cols}")
# distinct count
distinct_count = empDF.distinct().count()
# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()
# using agg
empDF.agg({'name':'count','gender':'count'}).show()
# groupby count
empDF.groupBy("dept_id").count().show()
# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()
9. Frequently Asked Questions on count()
count()
and countDistinct()
? count()
returns the total number of rows, while countDistinct()
returns the number of distinct values in a column.
PySpark provides approxCountDistinct()
for approximate distinct counting using the HyperLogLog algorithm, which can be faster for large datasets.
We can use groupBy
and count()
to count the occurrences of each value in a column.
For Example:
counts = df.groupBy(“column_name”).count()
isNull()
or isNotNull()
functions combined with count()
to count the number of null or non-null values in a column. For Example:
null_count = df.filter(df[‘column_name’].isNull()).count()
non_null_count = df.filter(df[‘column_name’].isNotNull()).count()
8. Conclusion
In conclusion, counting in PySpark is a fundamental operation that allows users to determine the size of datasets, perform data validation, and gain insights into the distribution of data across different groups. Through various methods such as count()
for RDDs and DataFrames, functions.count()
for counting non-null values in columns, and GroupedData.count()
for counting rows after grouping, PySpark provides versatile tools for efficiently computing counts at scale.
Related Articles
- PySpark Count Distinct from DataFrame
- PySpark sum() Columns Example
- PySpark count() – Different Methods Explained
- PySpark Groupby Count Distinct
- PySpark – Find Count of null, None, NaN Values
- PySpark isNull() & isNotNull()
- PySpark cache() Explained.
- PySpark Groupby on Multiple Columns
- PySpark Groupby Agg (aggregate) – Explained
- PySpark NOT isin() or IS NOT IN Operator
- PySpark isin() & SQL IN Operator