You are currently viewing What does setMaster(local[*]) mean in Spark

What does setMaster(local[*]) mean in Spark? I would explain what is setMaster() function used for and what is the meaning of value local[*] in Spark. Apache Spark is the largest open source project with a unified analytics engine for large-scale data processing.

Understanding the Spark core architecture helps us to know what exactly setMaster() and param local[*] means in Spark, so let’s go through it once.

1. Spark Architecture

Simply Spark works using the Master-Slave mechanism. 

setMaster local[*] in spark
Spark Architecture
source: Spark

Here, the key concepts to remember are

Master or Driver : Heart of the spark application. Run’s your main() function and analyze the code and executor’s required to complete this work and distributes the work to Slaves(executor’s) in terms of Task

Cluster Manager: A cluster or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. It provides executor resources as requested by the Master or Driver.

Executor or Slave: The executors are responsible for carrying out the work that the driver assigns them and reporting the state of the computation on that executor back to driver.

Core elements of spark.

So on a high level, the master in spark analyzes the work that is needed and the executor will complete that work assigned to it by the master.

2. What are Task/Cores/Slots/Threads?

Task: It is the piece of code that an executor gets to run. Cores are the maximum number of tasks an executor can run in parallel.

Slot: It is sometimes referred to as the number of cores available per executor. Although slots are often referred to as cores in Spark, they’re implemented as threads that work on a physical core’s thread and don’t need to correspond to the number of physical CPU cores on the machine (since different CPU manufacturers can architect multi-threaded chips differently). 

Threads: are the virtual components or codes, which divide the physical core of a CPU into multiple virtual cores.

Let’s put our understanding together using the above diagram, the diagram above is showing 2 Core Executor nodes

  • A task is a piece of code that runs on the executor. Simply, in spark one Task is created for processing one partition
    1 Task = 1 Partition
    By creating Tasks, the Driver can assign units of work to Slots on each Executor for parallel execution.
  • Number of Cores per executor = Number of Tasks it can run parallel per executor.
    Here in our case, each executor can run 2 tasks
    1 Executor = 2 Cores = 2 Tasks (in parallel)
  • Most processors of today are multi-threaded. If your CPU splits 1 physical core into 2 virtual cores i.e. 2 threads
    1 Executor = 2 Cores = 4 Threads

So overall, each Executor can run n Partitions(Tasks) in parallel using n cores available per executor.

3. What does setMaster(local[*]) in Spark?

It is used when you are creating a SparkSession or SparkConf object.

setMaster local[*] in spark
SparkConf

Here, setMaster() denotes where to run your spark application local or cluster. When you run on a cluster, you need to specify the address of the Spark master or Driver URL for a distributed cluster. We usually assign a local[*] value to setMaster() in spark while doing internal testing. For example, running spark application on your system (not cluster). Below are possible values you can use with setMaster() function.

Spark Master URL's
MasterURLs

setMaster(local[*]) is to run Spark locally with as many worker threads as logical cores on your machine.

Alternatively, you can also set this value with the spark-shell or spark-submit command. Below is an example of how to use it with spark shell.


./bin/spark-shell --master local[*]

We can check the number of threads available on your windows machine using Task Manager.

  • Open Task Manager (press Ctrl+Shift+Esc)
  • Select the Performance tab.
  • Look for Cores and Logical Processors (Threads)
setMaster local[*] in spark
Core and Threads

If you want to run the spark application on your machine with 4 cores and 8 threads as shown in the above picture, configuring setMaster(local[*]) in spark will use the 8 threads available on the machine.

For a better picture, I can set up my config as

Number of executors (A)= 1 Executor No of cores per executors (B) = 2 cores (considering Driver has occupied 2 cores) No of Threads/ executor(C) = 4 Threads (2 * B) setMaster value would be = local[1]

Here Run Spark locally with 2 worker threads (ideally, set this to the number of cores on your machine).

4. Conclusion

In this article, I have explained what does setMaster() value to local[*] means in Spark. In spark setMaster() mostly takes the address of the master with its value, if setMaster() value is set to local[*] in means the master is running in local with all the threads available.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium