PySpark Get Number of Rows and Columns

In this article, I will explain different ways to get the number of rows in the PySpark/Spark DataFrame (count of rows) and also different ways to get the number of columns present in the DataFrame (size of columns) by using PySpark count() function.

1. Quick Examples of Getting Number of Rows & Columns

Following are quick examples of getting the number of rows & columns.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")


# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# using agg
empDF.agg({'name':'count','gender':'count'}).show()

# groupby count
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Let’s create a PySpark DataFrame from List.


# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

Yields below output.

pyspark number of rows

2. PySpark Get Row Count

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

3. PySpark Get Column Count

To get the number of columns present in the PySpark DataFrame, use DataFrame.columns with len() function. Here, DataFrame.columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.


# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

4. Count NULL Values

To get the count of null values in a specified column.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
empDF.select([count(when(isnan(gender) | col(gender).isNull(), 
    gender)).alias(gender) for gender in empDF.columns]).show()

Yields below output.


+------+----+-------+------+------+
|emp_id|name|dept_id|gender|salary|
+------+----+-------+------+------+
|     0|   0|      0|     1|     0|
+------+----+-------+------+------+

5. Count Values in Column

pyspark.sql.functions.count() is used to get the number of values in a column. By using this we can perform a count of a single column and a count of multiple columns of DataFrame. While performing the count it ignores the null/none values from the column. In the below example,

  • DataFrame.select() is used to get the DataFrame with the selected columns.
  • empDF.name refers to the name column of the DataFrame.
  • count(empDF.name) count the number of values in a specified column.

# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

Yields below output.

PySpark row count

6. Conclusion

In this article, you have learned how to get the total number of rows and a total number of columns in a PySpark DataFrame by using count() and len() functions respectively, and also learned how to get count of rows and count of each group of rows using SQL query

Related Articles

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark Get Number of Rows and Columns