You are currently viewing Spark Get Current Number of Partitions of DataFrame

While working with Spark/PySpark we often need to know the current number of partitions on DataFrame/RDD as changing the size/length of the partition is one of the key factors to improve Spark/PySpark job performance, in this article let’s learn how to get the current partitions count/size with examples.

Related: How Spark Shuffle works?

1. Spark with Scala/Java

Spark RDD provides getNumPartitions, partitions.length and partitions.size that returns the length/size of current RDD partitions, in order to use this on DataFrame, first you need to Convert DataFrame to RDD using df.rdd


// RDD
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

// For DataFrame, convert to RDD first
df.rdd.getNumPartitions
df.rdd.partitions.length
df.rdd.partitions.size

2. PySpark (Spark with Python)

Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.


// RDD
rdd.getNumPartitions()

// For DataFrame, convert to RDD first
df.rdd.getNumPartitions()

3. Working with Partitions

  • For shuffle operations like reduceByKey(), join(), RDD inherit the partition size from the parent RDD.
  • For DataFrame’s, the partition size of the shuffle operations like groupBy(), join() defaults to the value set for spark.sql.shuffle.partitions.

4. How does Spark decide the Partitions?

Spark decides the partition size based on several factors, among all them the main factor is where and how are you running it? Is your master is local[x] or yarn.

4.1 When Master is Local[x]

When you are running Spark application in local which master() set to local[X] (X should be an integer value and should be greater than 0), Spark creates these many partitions when creating RDD, DataFrame, and Dataset. Ideally, the X value should be the number of CPU cores you have.

4.2 When Master is yarn or any Cluster Manager

When you are running Spark application in yarn or any cluster manager, the default length/size of partitions RDD/DataFrame/Dataset are created with the total number of cores on all executor nodes. 

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium