You are currently viewing PySpark Get Number of Rows and Columns

How to get the number of rows and columns from PySpark DataFrame? You can use the PySpark count() function to get the number of rows (count of rows) and use len(df.columns) to get the number of columns (count of columns) from the DataFrame.

Advertisements

Some Key points on getting the number of rows and columns in PySpark-

  • Use df.count() to return the total number of rows in the PySpark DataFrame. This function triggers all transformations on the DataFrame to execute.
  • Use df.distinct().count() to find the number of unique rows in the PySpark DataFrame.
  • Use len(df.columns) to get the number of columns in the DataFrame.
  • You can also get the column count using len(df.dtypes) by retrieving all column names and data types as a list of tuples and applying len() on the list.
  • To count null values in columns, you can use functions like count(when(isnan(column) | col(column).isNull(), column)) for each column to find the number of null, None, or NaN values.
  • For counting values in a column, use pyspark.sql.functions.count(column) to count non-null values in a specific column. It ignores null/none values.

1. Quick Examples of Getting Number of Rows & Columns

Following are quick examples of getting the number of rows & columns.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

# Using functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# Using agg
empDF.agg({'name':'count','gender':'count'}).show()

# Using groupBy().count()
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Let’s create a PySpark DataFrame from List.


# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
# EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

Yields below output.

pyspark number of rows

2. PySpark Get Row Count Using count() method

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

3. Get Distinct Number of Rows

In PySpark, you can get a distinct number of rows and columns from a DataFrame using a combination of distinct() and count() methods provided by the PySpark DataFrame API.

To get the distinct number of rows, you can use the count method after applying the distinct transformation on the DataFrame. This combination ensures that you count only the unique rows.


# Get distinct row count using distinct()
rows = empDF.distinct().count()
print(f"DataFrame distinct row count : {rows}")

You can eliminate duplicate rows from one or more columns in a PySpark DataFrame by using the dropDuplicates() method from pyspark.sql.DataFrame.

4. PySpark Get Column Count Using len() method

To get the number of columns present in the PySpark DataFrame, use DataFrame.columns with len() function. Here, DataFrame.columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.


# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

5. PySpark Get Column Count Using dtypes

You can utilize df.dtypes to obtain a list of tuples containing column names and their corresponding data types in a PySpark DataFrame. You can iterate through this list to extract both the column name and data type from each tuple. To determine the number of columns in a DataFrame, you can apply the len() function to df.dtypes, as demonstrated below.


# Get Column count Using len(df.dtypes) method
col = len(empDF.dtypes)
print(f"DataFrame Column count: {col}")
 

6. Count NULL Values

To get the count of null values on a specified column.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
empDF.select([count(when(isnan(gender) | col(gender).isNull(), 
    gender)).alias(gender) for gender in empDF.columns]).show()

Yields below output.


# Output
+------+----+-------+------+------+
|emp_id|name|dept_id|gender|salary|
+------+----+-------+------+------+
|     0|   0|      0|     1|     0|
+------+----+-------+------+------+

7. Count Values in Column

pyspark.sql.functions.count() is used to get the number of values in a column. By using this we can perform a count of a single column and a count of multiple columns of DataFrame.

During counting, it disregards any null or none values present in the column.


# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

Yields below output.

PySpark number of columns

8. Frequently Asked Questions on row counts and column counts of DataFrame

How to find the size of a PySpark Dataframe?

PySpark DataFrame size can be determined in terms of number of rows and columns(DataFrame dimentions). To find the count on rows use df.count(), and for columns use len(df.columns) .

What is the use of row counts and column counts in data analysis?

Row counts indicate the number of records in a dataset, providing an overview of its size. Also it helps understand the dataset’s size and evaluate data quality.
Column counts show the number of attributes or features, which affects the dimensionality and complexity of the data. Column count is essential for assessing the dimensionality, feature engineering, and understanding the data structure.

What should I do if my DataFrame has missing or null values when calculating row counts or column counts?

Handle missing or null values before counting rows or columns, as they can affect the accuracy of the results. We can use methods like na.drop() to remove rows with missing values.

9. Conclusion

In this article, you have learned how to get the total number of rows and total number of columns in a PySpark DataFrame by using count() and len() functions respectively, and also learned how to get the count of rows and count of each group of rows using SQL query

Related Articles

References