While working with Spark/PySpark we often need to know the current number of partitions on DataFrame/RDD as changing the size/length of the partition is one of the key factors to improve Spark/PySpark job performance, in this article let’s learn how to get the current partitions count/size with examples.
Related: How Spark Shuffle works?
1. Spark with Scala/Java
Spark RDD provides getNumPartitions
, partitions.length
and partitions.size
that returns the length/size of current RDD partitions, in order to use this on DataFrame, first you need to Convert DataFrame to RDD using df.rdd
// RDD
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size
// For DataFrame, convert to RDD first
df.rdd.getNumPartitions
df.rdd.partitions.length
df.rdd.partitions.size
2. PySpark (Spark with Python)
Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions()
of RDD class, so to use with DataFrame first you need to convert to RDD.
// RDD
rdd.getNumPartitions()
// For DataFrame, convert to RDD first
df.rdd.getNumPartitions()
3. Working with Partitions
- For shuffle operations like
reduceByKey()
,join()
, RDD inherit the partition size from the parent RDD.
- For DataFrame’s, the partition size of the shuffle operations like
groupBy()
,join()
defaults to the value set forspark.sql.shuffle.partitions
.
- Instead of using the default, In case if you want to increase or decrease the size of the partition, Spark provides a way to repartition the RDD/DataFrame at runtime using repartition() & coaleasce() transformations. In order to decrease the partitions use coalesce() as this is the most effective way.
4. How does Spark decide the Partitions?
Spark decides the partition size based on several factors, among all them the main factor is where and how are you running it? Is your master is local[x] or yarn.
4.1 When Master is Local[x]
When you are running Spark application in local which master()
set to local[X]
(X should be an integer value and should be greater than 0), Spark creates these many partitions when creating RDD, DataFrame, and Dataset. Ideally, the X value should be the number of CPU cores you have.
4.2 When Master is yarn or any Cluster Manager
When you are running Spark application in yarn
or any cluster manager, the default length/size of partitions RDD/DataFrame/Dataset are created with the total number of cores on all executor nodes.
Happy Learning !!