Site icon Spark By {Examples}

Calculate Size of Spark DataFrame & RDD

sparksize dataframe

Sometimes we may require to know or calculate the size of the Spark Dataframe or RDD that we are processing, knowing the size we can either improve the Spark job performance or implement better application logic or even resolve the out-of-memory issues. Sometimes it’s also helpful to know the size if you are broadcasting the DataFrame to do broadcast join.

In this article, we will discuss how we can calculate the size of the Spark RDD/DataFrame.

Quick Example to find the size of DataFrame using SizeEstimator


//Create a DataFrame
import spark.implicits._
val someDF = Seq(
  (1, "bat"),
  (2, "mouse"),
  (3, "horse")
).toDF("number", "animal")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val dfSize = SizeEstimator.estimate(someDF)
println(s"Estimated size of the dataFrame weatherDF = ${dfSize/1000000} mb")

//Output
//Estimated size of the dataFrame weatherDF = 33 mb

1. Use Case

2. Calculate the Size of Spark DataFrame

The spark utils module provides org.apache.spark.util.SizeEstimator that helps to Estimate the sizes of Java objects (number of bytes of memory they occupy), for use in-memory caches. We can use this class to calculate the size of the Spark Dataframe. See org.apache.spark.util

Let us calculate the size of the dataframe using the DataFrame created locally. Here below we created a DataFrame using spark implicts and passed the DataFrame to the size estimator function to yield its size in bytes.


//Create a dataFrame
import spark.implicits._
val someDF = Seq(
  (1, "bat"),
  (2, "mouse"),
  (3, "horse")
).toDF("number", "animal")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val dfSize = SizeEstimator.estimate(someDF)
println(s"Estimated size of the dataFrame weatherDF = ${dfSize/1000000} mb")

//Output
# Estimated size of the dataFrame weatherDF = 33 mb

Let us the same approach and calculate the size of DataFrame in databricks using the weather datasets.


//Create a dataFrame
val weatherDF = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("/databricks-datasets/weather")

//Calculate size of weatherDF dataFrame using SizeEstimator
import org.apache.spark.util.SizeEstimator
val weatherDFSize = SizeEstimator.estimate(weatherDF)
println(s"size of the dataFrame weatherDF = ${weatherDFSize/1000000} mb")

//Output
//size of the dataFrame weatherDF = 210 mb

In Azure Databricks you will see something like the below.

Calculate Size Spark DataFrame
DataFrame Size Estimator

Here in the above example, we have tried estimating the size of the weatherDF dataFrame that was created using in databricks using databricks datasets. We passed the newly created weatherDF dataFrame as a parameter to the estimate function of the SizeEstimator which estimated the size of the object/DataFrame in bytes.

3. Calculating the Size of Spark RDD

Similarly, let’s calculate the size of the RDD, so first create an RDD using spark scala and try calculating its size.


//Create a RDD
val data=spark.sparkContext.parallelize(Seq(("maths"),("english"),("science"), ("computer"),("maths")))

//Calculate Size of RDD
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
val size = data.map(_.getBytes("UTF-8").length.toLong).reduce(_+_)
println(s"Estimated size of the RDD data = ${size} mb")

//Output
//Estimated size of the RDD data = 32 mb

Here we first created an RDD and using getBytes of the results we calculated the size of the RDD.

4. Conclusion

SizeEstimator from the Spark utils modules helps to estimate the size of the Dataframe/RDD you’re working with or the result after all the filtering. This is useful for estimating the size of heap space a broadcast variable will occupy on each executor or the amount of space each java object will take when caching objects in deserialized form. Also, this is not the same as the serialized size of the object, which will typically be much smaller.

Exit mobile version