• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:14 mins read
You are currently viewing PySpark Get Number of Rows and Columns

How to get the number of rows and columns from PySpark DataFrame? You can use the PySpark count() function to get the number of rows (count of rows) and use len(df.columns) to get the number of columns (count of columns) from the DataFrame.

Some Key points on getting the number of rows and columns in PySpark-

  • Use df.count() to return the total number of rows in the PySpark DataFrame. This function triggers all transformations on the DataFrame to execute.
  • Use df.distinct().count() to find the number of unique rows in the PySpark DataFrame.
  • Use len(df.columns) to get the number of columns in the DataFrame.
  • You can also get the column count using len(df.dtypes) by retrieving all column names and data types as a list of tuples and applying len() on the list.
  • To count null values in columns, you can use functions like count(when(isnan(column) | col(column).isNull(), column)) for each column to find the number of null, None, or NaN values.
  • For counting values in a column, use pyspark.sql.functions.count(column) to count non-null values in a specific column. It ignores null/none values.

1. Quick Examples of Getting Number of Rows & Columns

Following are quick examples of getting the number of rows & columns.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

# Using functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# Using agg
empDF.agg({'name':'count','gender':'count'}).show()

# Using groupBy().count()
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Let’s create a PySpark DataFrame from List.


# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
# EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

Yields below output.

pyspark number of rows

2. PySpark Get Row Count Using count() method

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

3. PySpark Get Distinct Number of Rows and Columns

pyspark.sql.DataFrame.distinct() is used to get the unique rows from all the columns from DataFrame. This function doesn’t take any argument and by default, it applies distinct on all columns.


# Get distinct row count using distinct()
rows = empDF.distinct().count()
print(f"DataFrame distinct row count : {rows}")

To drop the duplicate rows from the single or multiple columns use pyspark.sql.DataFrame.dropDuplicates() method. It returns a new DataFrame with duplicate rows removed, when columns are used as arguments, it only considers the selected columns.

4. PySpark Get Column Count Using len() method

To get the number of columns present in the PySpark DataFrame, use DataFrame.columns with len() function. Here, DataFrame.columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.


# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

5. PySpark Get Column Count Using dtypes

By using df.dtypes you can retrieve PySpark DataFrame all column names and data type (datatype) as a list of tuple. Iterate the list and get the column name & data type from the tuple. In order to ge the number of columns of a DataFrame we apply len() on df.dtypes as show below.


# Get Column count Using len(df.dtypes) method
col = len(empDF.dtypes)
print(f"DataFrame Column count: {col}")
 

6. Count NULL Values

To get the count of null values on a specified column.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
empDF.select([count(when(isnan(gender) | col(gender).isNull(), 
    gender)).alias(gender) for gender in empDF.columns]).show()

Yields below output.


# Output
+------+----+-------+------+------+
|emp_id|name|dept_id|gender|salary|
+------+----+-------+------+------+
|     0|   0|      0|     1|     0|
+------+----+-------+------+------+

7. Count Values in Column

pyspark.sql.functions.count() is used to get the number of values in a column. By using this we can perform a count of a single column and a count of multiple columns of DataFrame. While performing the count it ignores the null/none values from the column. In the below example,

  • DataFrame.select() is used to get the DataFrame with the selected columns.
  • empDF.name refers to the name column of the DataFrame.
  • count(empDF.name) count the number of values in a specified column.

# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

Yields below output.

PySpark number of columns

8. Frequently Asked Questions on row counts and column counts of DataFrame

How to find the size of a PySpark Dataframe?

PySpark DataFrame size can be determined in terms of number of rows and columns(DataFrame dimentions). To find the count on rows use df.count(), and for columns use len(df.columns) .

What is the use of row counts and column counts in data analysis?

Row counts indicate the number of records in a dataset, providing an overview of its size. Also it helps understand the dataset’s size and evaluate data quality.
Column counts show the number of attributes or features, which affects the dimensionality and complexity of the data. Column count is essential for assessing the dimensionality, feature engineering, and understanding the data structure.

What should I do if my DataFrame has missing or null values when calculating row counts or column counts?

Handle missing or null values before counting rows or columns, as they can affect the accuracy of the results. We can use methods like na.drop() to remove rows with missing values.

9. Conclusion

In this article, you have learned how to get the total number of rows and total number of columns in a PySpark DataFrame by using count() and len() functions respectively, and also learned how to get the count of rows and count of each group of rows using SQL query

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium