You are currently viewing Calculate Size of Spark DataFrame & RDD

Sometimes we may require to know or calculate the size of the Spark Dataframe or RDD that we are processing, knowing the size we can either improve the Spark job performance or implement better application logic or even resolve the out-of-memory issues. Sometimes it’s also helpful to know the size if you are broadcasting the DataFrame to do broadcast join.

Advertisements

In this article, we will discuss how we can calculate the size of the Spark RDD/DataFrame.

Quick Example to find the size of DataFrame using SizeEstimator


//Create a DataFrame
import spark.implicits._
val someDF = Seq(
  (1, "bat"),
  (2, "mouse"),
  (3, "horse")
).toDF("number", "animal")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val dfSize = SizeEstimator.estimate(someDF)
println(s"Estimated size of the dataFrame weatherDF = ${dfSize/1000000} mb")

//Output
//Estimated size of the dataFrame weatherDF = 33 mb

1. Use Case

  • Determine Cache layer: Estimating the magnitude of RDD/DataFrame would help to choose the level of cache (Memory or Disk) in spark.
  • Improve Performace: As we know Broadcast join helps to improve performance by storing data of smaller dataframe inside each executor memory. We can determine if we can really use the Broadcast join based on the magnitude of DataFrames we are joining.
  • Partition: Calculating the size of the Dataframe can also help in knowing the magnitude of the partition we can achieve. Using this way we can try to distribute data evenly across partitions.

2. Calculate the Size of Spark DataFrame

The spark utils module provides org.apache.spark.util.SizeEstimator that helps to Estimate the sizes of Java objects (number of bytes of memory they occupy), for use in-memory caches. We can use this class to calculate the size of the Spark Dataframe. See org.apache.spark.util

Let us calculate the size of the dataframe using the DataFrame created locally. Here below we created a DataFrame using spark implicts and passed the DataFrame to the size estimator function to yield its size in bytes.


//Create a dataFrame
import spark.implicits._
val someDF = Seq(
  (1, "bat"),
  (2, "mouse"),
  (3, "horse")
).toDF("number", "animal")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val dfSize = SizeEstimator.estimate(someDF)
println(s"Estimated size of the dataFrame weatherDF = ${dfSize/1000000} mb")

//Output
# Estimated size of the dataFrame weatherDF = 33 mb

Let us the same approach and calculate the size of DataFrame in databricks using the weather datasets.


//Create a dataFrame
val weatherDF = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("/databricks-datasets/weather")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val weatherDFSize = SizeEstimator.estimate(weatherDF)
println(s"size of the dataFrame weatherDF = ${weatherDFSize/1000000} mb")

//Output
//size of the dataFrame weatherDF = 210 mb

In Azure Databricks you will see something like the below.

Calculate Size Spark DataFrame
DataFrame Size Estimator

Here in the above example, we have tried estimating the size of the weatherDF dataFrame that was created using in databricks using databricks datasets. We passed the newly created weatherDF dataFrame as a parameter to the estimate function of the SizeEstimator which estimated the size of the object/DataFrame in bytes.

3. Calculating the Size of Spark RDD

Similarly, let’s calculate the size of the RDD, so first create an RDD using spark scala and try calculating its size.


//Create a RDD
val data=spark.sparkContext.parallelize(Seq(("maths"),("english"),("science"), ("computer"),("maths")))

//Calculate Size of RDD
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
val size = data.map(_.getBytes("UTF-8").length.toLong).reduce(_+_)
println(s"Estimated size of the RDD data = ${size} mb")

//Output
//Estimated size of the RDD data = 32 mb

Here we first created an RDD and using getBytes of the results we calculated the size of the RDD.

4. Conclusion

SizeEstimator from the Spark utils modules helps to estimate the size of the Dataframe/RDD you’re working with or the result after all the filtering. This is useful for estimating the size of heap space a broadcast variable will occupy on each executor or the amount of space each java object will take when caching objects in deserialized form. Also, this is not the same as the serialized size of the object, which will typically be much smaller.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium