Difference between spark.sql.shuffle.partitions vs spark.default.parallelism?

Spark provides spark.sql.shuffle.partitions and spark.default.parallelism configurations to work with parallelism or partitions, If you are new to the Spark you might have a big question what is the difference between spark.sql.shuffle.partitions vs spark.default.parallelism properties and when to use one.

Before we jump into the differences let’s understand what is Spark shuffle? The Spark shuffle is a mechanism for redistributing or re-partitioning data so that the data grouped differently across partitions. Spark shuffle is a very expensive operation as it moves the data between executors or even between worker nodes in a cluster. Spark automatically triggers the shuffle when we perform aggregation and join operations on RDD and DataFrame.

As the shuffle operations re-partitions the data, we can use configurations spark.default.parallelism and spark.sql.shuffle.partitions to control the number of partitions shuffle creates.

spark.default.parallelism vs spark.sql.shuffle.partitions

RDD: spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.

For RDD, wider transformations like reduceByKey()groupByKey()join() triggers the data shuffling. Prior to using these operations, use the below code to set the desired partitions (change the value accordingly) for shuffle operations.

sqlContext.setConf("spark.default.parallelism", "100") // older versions use sqlcontext

DataFrame: Whereas spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame, the default value for this configuration set to 200.

For DataFrame, wider transformations like groupBy()join() triggers the data shuffling hence the result of these transformations results in partition size same as the value set in spark.sql.shuffle.partitions. Prior to using these operations, use the below code to get the desired partitions (change the value according to your need).

sqlContext.setConf("spark.sql.shuffle.partitions", "100") // older version

Note: If the RDD/DataFrame transformations you are applying don’t trigger the data shuffle then these configurations are ignored by Spark.

In real-time, we usually set these values with spark-submit as shown below

spark-submit --conf spark.sql.shuffle.partitions=100 \
             --conf spark.default.parallelism=100

What value to use?

Based on your dataset size, a number of cores, and memory, Spark shuffling can benefit or harm your jobs. When you dealing with less amount of data, you should typically reduce the shuffle partitions otherwise you will end up with many partitioned files with less number of records in each partition. which results in running many tasks with lesser data to process.

On other hand, when you have too much of data and having less number of partitions results in fewer longer running tasks and some times you may also get out of memory error.

Getting a right size of the shuffle partition is always tricky and takes many runs with different value to achieve the optimized number. This is one of the key property to look for when you have performance issues on Spark jobs.

Happy Learning !!


SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply