Spark Get Current Number of Partitions of DataFrame

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:March 27, 2024
Reading time:5 mins read

You are currently viewing Spark Get Current Number of Partitions of DataFrame

While working with Spark/PySpark we often need to know the current number of partitions on DataFrame/RDD as changing the size/length of the partition is one of the key factors to improve Spark/PySpark job performance, in this article let’s learn how to get the current partitions count/size with examples.

1. Spark with Scala/Java

Spark RDD provides getNumPartitions, partitions.length and partitions.size that returns the length/size of current RDD partitions, in order to use this on DataFrame, first you need to Convert DataFrame to RDD using df.rdd


// RDD
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

// For DataFrame, convert to RDD first
df.rdd.getNumPartitions
df.rdd.partitions.length
df.rdd.partitions.size

2. PySpark (Spark with Python)

Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.


// RDD
rdd.getNumPartitions()

// For DataFrame, convert to RDD first
df.rdd.getNumPartitions()

3. Working with Partitions

For shuffle operations like reduceByKey(), join(), RDD inherit the partition size from the parent RDD.

For DataFrame’s, the partition size of the shuffle operations like groupBy(), join() defaults to the value set for spark.sql.shuffle.partitions.

Instead of using the default, In case if you want to increase or decrease the size of the partition, Spark provides a way to repartition the RDD/DataFrame at runtime using repartition() & coaleasce() transformations. In order to decrease the partitions use coalesce() as this is the most effective way.

4. How does Spark decide the Partitions?

Spark decides the partition size based on several factors, among all them the main factor is where and how are you running it? Is your master is local[x] or yarn.

4.1 When Master is Local[x]

When you are running Spark application in local which master() set to local[X] (X should be an integer value and should be greater than 0), Spark creates these many partitions when creating RDD, DataFrame, and Dataset. Ideally, the X value should be the number of CPU cores you have.

4.2 When Master is yarn or any Cluster Manager

When you are running Spark application in yarn or any cluster manager, the default length/size of partitions RDD/DataFrame/Dataset are created with the total number of cores on all executor nodes.

Happy Learning !!

References

https://spark.apache.org/docs/latest/configuration.html

Tags: getNumPartitions()