Difference between spark.sql.shuffle.partitions vs spark.default.parallelism?

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:August 24, 2023
Reading time:5 mins read

You are currently viewing Difference between spark.sql.shuffle.partitions vs spark.default.parallelism?

Spark provides spark.sql.shuffle.partitions and spark.default.parallelism configurations to work with parallelism or partitions, If you are new to the Spark you might have a big question what is the difference between spark.sql.shuffle.partitions vs spark.default.parallelism properties and when to use one.

1. spark.default.parallelism vs spark.sql.shuffle.partitions

RDD: spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.

For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Prior to using these operations, use the below code to set the desired partitions (change the value accordingly) for shuffle operations.


// spark.default.parallelism vs spark.sql.shuffle.partitions
spark.conf.set("spark.default.parallelism",100)
sqlContext.setConf("spark.default.parallelism", "100") // older versions use sqlcontext

DataFrame: Whereas spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame, the default value for this configuration set to 200.

For DataFrame, wider transformations like groupBy(), join() triggers the data shuffling hence the result of these transformations results in partition size same as the value set in spark.sql.shuffle.partitions. Prior to using these operations, use the below code to get the desired partitions (change the value according to your need).


spark.conf.set("spark.sql.shuffle.partitions",100)
sqlContext.setConf("spark.sql.shuffle.partitions", "100") // older version

Note: If the RDD/DataFrame transformations you are applying don’t trigger the data shuffle then these configurations are ignored by Spark.

In real-time, we usually set these values with spark-submit as shown below


spark-submit --conf spark.sql.shuffle.partitions=100 \
             --conf spark.default.parallelism=100

What value to use?

Based on your dataset size, a number of cores, and memory, Spark shuffling can benefit or harm your jobs. When you dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with many partitioned files with less number of records in each partition. which results in running many tasks with lesser data to process.

On other hand, when you have too much of data and having less number of partitions results in fewer longer running tasks and some times you may also get out of memory error.

Getting a right size of the shuffle partition is always tricky and takes many runs with different value to achieve the optimized number. This is one of the key property to look for when you have performance issues on Spark jobs.

Happy Learning !!

1. spark.default.parallelism vs spark.sql.shuffle.partitions

What value to use?

Related Articles

Leave a Reply