PySpark Get the Size or Shape of a DataFrame

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df.columns()) to get the number of columns.

PySpark Get Size and Shape of DataFrame

The size of the DataFrame is nothing but the number of rows in a PySpark DataFrame and Shape is a number of rows & columns, if you are using Python pandas you can get this simply by running pandasDF.shape


from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()
data = [('Scott', 50), ('Jeff', 45), ('Thomas', 54),('Ann',34)] 
sparkDF=spark.createDataFrame(data,["name","age"]) 
sparkDF.printSchema()
sparkDF.show()

print((sparkDF.count(), len(sparkDF.columns)))
# Displays (4, 2)

Prints (4,2) – meaning 4 rows and 2 columns. Here, sparkDF.count() is an action that returns the number of rows in a DataFrame and sparkDF.columns returns all columns in a list, python len() function returns the length of the list.


# Displays shape of dataFrame 
# 4 - Rows
# 2 - Columns
(4, 2)

Another Example


import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(sparkDF.shape())

If you have a small dataset, you can Convert PySpark DataFrame to Pandas and call the shape that returns a tuple with DataFrame rows & columns count. If your dataset doesn’t fit in Spark driver memory, do not run toPandas() as it is an action and collects all data to Spark driver and eventually you may get an OutOfmemory error.


import pandas as pd    
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandasDF=sparkDF.toPandas()
print(pandasDF.shape)

By using spark.sql.execution.arrow.enabled config, Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM.

Conclusion

Spark DataFrame doesn’t have a method shape() to return the size of the rows and columns of the DataFrame however, you can achieve this by getting PySpark DataFrame rows and columns size separately.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply