What does setMaster(local[*]) mean in Spark? I would explain what is setMaster() function used for and what is the meaning of value local[*] in Spark. Apache Spark is the largest open source project with a unified analytics engine for large-scale data processing.
Understanding the Spark core architecture helps us to know what exactly
setMaster() and param
local[*] means in Spark, so let’s go through it once.
1. Spark Architecture
Simply Spark works using the Master-Slave mechanism.
Here, the key concepts to remember are
Master or Driver : Heart of the spark application. Run’s your main() function and analyze the code and executor’s required to complete this work and distributes the work to Slaves(executor’s) in terms of Task
Cluster Manager: A cluster or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. It provides executor resources as requested by the Master or Driver.
Executor or Slave: The executors are responsible for carrying out the work that the driver assigns them and reporting the state of the computation on that executor back to driver.Core elements of spark.
So on a high level, the master in spark analyzes the work that is needed and the executor will complete that work assigned to it by the master.
2. What are Task/Cores/Slots/Threads?
Task: It is the piece of code that an executor gets to run. Cores are the maximum number of tasks an executor can run in parallel.
Slot: It is sometimes referred to as the number of cores available per executor. Although slots are often referred to as cores in Spark, they’re implemented as threads that work on a physical core’s thread and don’t need to correspond to the number of physical CPU cores on the machine (since different CPU manufacturers can architect multi-threaded chips differently).
Threads: are the virtual components or codes, which divide the physical core of a CPU into multiple virtual cores.
Let’s put our understanding together using the above diagram, the diagram above is showing 2 Core Executor nodes
- A task is a piece of code that runs on the executor. Simply, in spark one Task is created for processing one partition
By creating Tasks, the Driver can assign units of work to Slots on each Executor for parallel execution.
- Number of Cores per executor = Number of Tasks it can run parallel per executor.
Here in our case, each executor can run 2 tasks
- Most processors of today are multi-threaded. If your CPU splits 1 physical core into 2 virtual cores i.e. 2 threads
So overall, each Executor can run n Partitions(Tasks) in parallel using n cores available per executor.
3. What does setMaster(local[*]) in Spark?
It is used when you are creating a SparkSession or SparkConf object.
setMaster() denotes where to run your spark application local or cluster. When you run on a cluster, you need to specify the address of the Spark master or Driver URL for a distributed cluster. We usually assign a
local[*] value to
setMaster() in spark while doing internal testing. For example, running spark application on your system (not cluster). Below are possible values you can use with
setMaster(local[*]) is to run Spark locally with as many worker threads as logical cores on your machine.
./bin/spark-shell --master local[*]
We can check the number of threads available on your windows machine using Task Manager.
- Open Task Manager (press Ctrl+Shift+Esc)
- Select the Performance tab.
- Look for Cores and Logical Processors (Threads)
If you want to run the spark application on your machine with 4 cores and 8 threads as shown in the above picture, configuring
setMaster(local[*]) in spark will use the 8 threads available on the machine.
For a better picture, I can set up my config asNumber of executors (A)= 1 Executor No of cores per executors (B) = 2 cores (considering Driver has occupied 2 cores) No of Threads/ executor(C) = 4 Threads (2 * B) setMaster value would be = local
Here Run Spark locally with 2 worker threads (ideally, set this to the number of cores on your machine).
In this article, I have explained what does setMaster() value to
local[*] means in Spark. In spark
setMaster() mostly takes the address of the master with its value, if
setMaster() value is set to
local[*] in means the master is running in local with all the threads available.