While working with Spark/PySpark we often need to know the current number of partitions on DataFrame/RDD as changing the size/length of the partition is one of the key factors to improve Spark/PySpark job performance, in this article let’s learn how to get the current partitions count/size with examples.
Related: How Spark Shuffle works?
1. Spark with Scala/Java
Spark RDD provides
partitions.size that returns the length/size of current RDD partitions, in order to use this on DataFrame, first you need to Convert DataFrame to RDD using
// RDD rdd.getNumPartitions rdd.partitions.length rdd.partitions.size // For DataFrame, convert to RDD first df.rdd.getNumPartitions df.rdd.partitions.length df.rdd.partitions.size
2. PySpark (Spark with Python)
Similarly, in PySpark you can get the current length/size of partitions by running
getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
// RDD rdd.getNumPartitions() // For DataFrame, convert to RDD first df.rdd.getNumPartitions()
3. Working with Partitions
- For shuffle operations like
join(), RDD inherit the partition size from the parent RDD.
- For DataFrame’s, the partition size of the shuffle operations like
join()defaults to the value set for
- Instead of using the default, In case if you want to increase or decrease the size of the partition, Spark provides a way to repartition the RDD/DataFrame at runtime using repartition() & coaleasce() transformations. In order to decrease the partitions use coalesce() as this is the most effective way.
4. How does Spark decide the Partitions?
Spark decides the partition size based on several factors, among all them the main factor is where and how are you running it? Is your master is local[x] or yarn.
4.1 When Master is Local[x]
When you are running Spark application in local which
master() set to
local[X] (X should be an integer value and should be greater than 0), Spark creates these many partitions when creating RDD, DataFrame, and Dataset. Ideally, the X value should be the number of CPU cores you have.
4.2 When Master is yarn or any Cluster Manager
When you are running Spark application in
yarn or any cluster manager, the default length/size of partitions RDD/DataFrame/Dataset are created with the total number of cores on all executor nodes.
Happy Learning !!