• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:4 mins read
You are currently viewing PySpark Get the Size or Shape of a DataFrame

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df.columns()) to get the number of columns.

PySpark Get Size and Shape of DataFrame

The size of the DataFrame is nothing but the number of rows in a PySpark DataFrame and Shape is a number of rows & columns, if you are using Python pandas you can get this simply by running pandasDF.shape


from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()
data = [('Scott', 50), ('Jeff', 45), ('Thomas', 54),('Ann',34)] 
sparkDF=spark.createDataFrame(data,["name","age"]) 
sparkDF.printSchema()
sparkDF.show()

print((sparkDF.count(), len(sparkDF.columns)))
# Displays (4, 2)

Prints (4,2) – meaning 4 rows and 2 columns. Here, sparkDF.count() is an action that returns the number of rows in a DataFrame and sparkDF.columns returns all columns in a list, python len() function returns the length of the list.


# Displays shape of dataFrame 
# 4 - Rows
# 2 - Columns
(4, 2)

Another Example


import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(sparkDF.shape())

If you have a small dataset, you can Convert PySpark DataFrame to Pandas and call the shape that returns a tuple with DataFrame rows & columns count. If your dataset doesn’t fit in Spark driver memory, do not run toPandas() as it is an action and collects all data to Spark driver and eventually you may get an OutOfmemory error.


import pandas as pd    
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandasDF=sparkDF.toPandas()
print(pandasDF.shape)

By using spark.sql.execution.arrow.enabled config, Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM.

Conclusion

Spark DataFrame doesn’t have a method shape() to return the size of the rows and columns of the DataFrame however, you can achieve this by getting PySpark DataFrame rows and columns size separately.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium