How to get the number of rows and columns from PySpark DataFrame? You can use the PySpark count() function to get the number of rows (count of rows) and use len(df.columns) to get the number of columns (count of columns) from the DataFrame.
Some Key points on getting the number of rows and columns in PySpark-
- Use
df.count()
to return the total number of rows in the PySpark DataFrame. This function triggers all transformations on the DataFrame to execute. - Use
df.distinct().count()
to find the number of unique rows in the PySpark DataFrame. - Use
len(df.columns)
to get the number of columns in the DataFrame. - You can also get the column count using
len(df.dtypes)
by retrieving all column names and data types as a list of tuples and applyinglen()
on the list. - To count null values in columns, you can use functions like
count(when(isnan(column) | col(column).isNull(), column))
for each column to find the number of null, None, or NaN values. - For counting values in a column, use
pyspark.sql.functions.count(column)
to count non-null values in a specific column. It ignores null/none values.
1. Quick Examples of Getting Number of Rows & Columns
Following are quick examples of getting the number of rows & columns.
# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")
# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")
# Using functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()
# Using agg
empDF.agg({'name':'count','gender':'count'}).show()
# Using groupBy().count()
empDF.groupBy("dept_id").count().show()
# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()
Let’s create a PySpark DataFrame from List.
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
# EMP DataFrame
empData = [(1,"Smith",10,None,3000),
(2,"Rose",20,"M",4000),
(3,"Williams",10,"M",1000),
(4,"Jones",10,"F",2000),
(5,"Brown",30,"",-1),
(6,"Brown",30,"",-1)
]
empColumns = ["emp_id","name","dept_id",
"gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()
Yields below output.
2. PySpark Get Row Count Using count() method
To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.
# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")
3. Get Distinct Number of Rows
In PySpark, you can get a distinct number of rows and columns from a DataFrame using a combination of distinct() and count() methods provided by the PySpark DataFrame API.
To get the distinct number of rows, you can use the count
method after applying the distinct
transformation on the DataFrame. This combination ensures that you count only the unique rows.
# Get distinct row count using distinct()
rows = empDF.distinct().count()
print(f"DataFrame distinct row count : {rows}")
You can eliminate duplicate rows from one or more columns in a PySpark DataFrame by using the dropDuplicates()
method from pyspark.sql.DataFrame
.
4. PySpark Get Column Count Using len() method
To get the number of columns present in the PySpark DataFrame, use DataFrame.columns
with len()
function. Here, DataFrame.columns
return all column names of a DataFrame as a list then use the len()
function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.
# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")
5. PySpark Get Column Count Using dtypes
You can utilize df.dtypes
to obtain a list of tuples containing column names and their corresponding data types in a PySpark DataFrame. You can iterate through this list to extract both the column name and data type from each tuple. To determine the number of columns in a DataFrame, you can apply the len()
function to df.dtypes
, as demonstrated below.
# Get Column count Using len(df.dtypes) method
col = len(empDF.dtypes)
print(f"DataFrame Column count: {col}")
6. Count NULL Values
To get the count of null values on a specified column.
# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
empDF.select([count(when(isnan(gender) | col(gender).isNull(),
gender)).alias(gender) for gender in empDF.columns]).show()
Yields below output.
# Output
+------+----+-------+------+------+
|emp_id|name|dept_id|gender|salary|
+------+----+-------+------+------+
| 0| 0| 0| 1| 0|
+------+----+-------+------+------+
7. Count Values in Column
pyspark.sql.functions.count() is used to get the number of values in a column. By using this we can perform a count of a single column and a count of multiple columns of DataFrame.
During counting, it disregards any null or none values present in the column.
# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()
Yields below output.
8. Frequently Asked Questions on row counts and column counts of DataFrame
PySpark DataFrame size can be determined in terms of number of rows and columns(DataFrame dimentions). To find the count on rows use df.count(),
and for columns use len(df.columns)
.
Row counts indicate the number of records in a dataset, providing an overview of its size. Also it helps understand the dataset’s size and evaluate data quality.
Column counts show the number of attributes or features, which affects the dimensionality and complexity of the data. Column count is essential for assessing the dimensionality, feature engineering, and understanding the data structure.
Handle missing or null values before counting rows or columns, as they can affect the accuracy of the results. We can use methods like na.drop()
to remove rows with missing values.
9. Conclusion
In this article, you have learned how to get the total number of rows and total number of columns in a PySpark DataFrame by using count() and len() functions respectively, and also learned how to get the count of rows and count of each group of rows using SQL query
Related Articles
- PySpark Count Distinct from DataFrame
- PySpark – Find Count of null, None, NaN Values
- PySpark isNull() & isNotNull()
- PySpark NOT isin() or IS NOT IN Operator
- PySpark isin() & SQL IN Operator
- PySpark Select Top N Rows From Each Group
- PySpark Find Maximum Row per Group in DataFrame
- PySpark Select First Row of Each Group?
- Show First Top N Rows in Spark | PySpark