PySpark Get Number of Rows and Columns

  • Post author:
  • Post category:PySpark
  • Post last modified:July 28, 2022

In this article, I will explain different ways to get the number of rows in the PySpark/Spark DataFrame (count of rows) and also different ways to get the number of columns present in the DataFrame (size of columns) by using PySpark count() function.

1. Quick Examples of Getting Number of Rows & Columns

Following are quick examples of getting the number of rows & columns.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")


# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

# using agg
empDF.agg({'name':'count','gender':'count'}).show()

# groupby count
empDF.groupBy("dept_id").count().show()

# PySpark SQL Count
empDF.createOrReplaceTempView("EMP")
spark.sql("SELECT COUNT(*) FROM EMP").show()
spark.sql("SELECT COUNT(distinct dept_id) FROM EMP").show()
spark.sql("SELECT DEPT_ID,COUNT(*) FROM EMP GROUP BY DEPT_ID").show()

Let’s create a PySpark DataFrame from List.


# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
         
#EMP DataFrame
empData = [(1,"Smith",10,None,3000),
    (2,"Rose",20,"M",4000),
    (3,"Williams",10,"M",1000),
    (4,"Jones",10,"F",2000),
    (5,"Brown",30,"",-1),
    (6,"Brown",30,"",-1)
  ]
  
empColumns = ["emp_id","name","dept_id",
  "gender","salary"]
empDF = spark.createDataFrame(empData,empColumns)
empDF.show()

Yields below output.

pyspark number of rows

2. PySpark Get Row Count

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame. By calling this function it triggers all transformations on this DataFrame to execute.


# Get row count
rows = empDF.count()
print(f"DataFrame Rows count : {rows}")

3. PySpark Get Column Count

To get the number of columns present in the PySpark DataFrame, use DataFrame.columns with len() function. Here, DataFrame.columns return all column names of a DataFrame as a list then use the len() function to get the length of the array/list which gets you the count of columns present in PySpark DataFrame.


# Get columns count
cols = len(empDF.columns)
print(f"DataFrame Columns count : {cols}")

4. Count NULL Values

To get the count of null values in a specified column.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
empDF.select([count(when(isnan(gender) | col(gender).isNull(), 
    gender)).alias(gender) for gender in empDF.columns]).show()

Yields below output.


+------+----+-------+------+------+
|emp_id|name|dept_id|gender|salary|
+------+----+-------+------+------+
|     0|   0|      0|     1|     0|
+------+----+-------+------+------+

5. Count Values in Column

pyspark.sql.functions.count() is used to get the number of values in a column. By using this we can perform a count of a single column and a count of multiple columns of DataFrame. While performing the count it ignores the null/none values from the column. In the below example,

  • DataFrame.select() is used to get the DataFrame with the selected columns.
  • empDF.name refers to the name column of the DataFrame.
  • count(empDF.name) count the number of values in a specified column.

# functions.count()
from pyspark.sql.functions import count
empDF.select(count(empDF.name)).show()
empDF.select(count(empDF.name), count(empDF.gender)).show()

Yields below output.

PySpark row count

6. Conclusion

In this article, you have learned how to get the total number of rows and a total number of columns in a PySpark DataFrame by using count() and len() functions respectively, and also learned how to get count of rows and count of each group of rows using SQL query

Related Articles

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark Get Number of Rows and Columns