What is the Spark driver in Apache Spark or PySpark? As we all know Apache Spark or PySpark works using the master(driver) – slave (worker) architecture in which the spark system uses a group of machines known as a cluster. These machines coordinate with each other across their network to get the work done. For a Spark system like this, we need a solo machine that governs these clusters, such a machine is a Spark driver.
In this article, let us discuss Apache Spark architecture, what a driver does/manage, and Spark/Pyspark driver configurations.
Table of contents
1. Apache Spark Architecture
Apache Spark is an open-source framework to process large amounts of structured, unstructured, and semi-structured data for analytics. It uses the single master and multiple slave architecture which consists of a single driver as master, cluster manager, and multiple worker nodes.
1.1 Spark Cluster Manager
Spark cluster manager helps to allocate the resources (workers) required for data processing. We have a variety of cluster managers such as Hadoop YARN, Apache Mesos, and Standalone Scheduler.
Standalone schedulers are the Spark standalone cluster manager that provides spark installed on empty machines based on the request.
1.2 Spark Driver Program
The Spark driver program is the one that creates SparkContext object in the application. As soon as we submit the spark job, the driver program runs the main() method of your application and creates DAG’s representing the data flow internally. Based on the DAG workflow, the driver requests the cluster manager to allocate the resources (workers) required for processing. Once the resources are allocated, the driver then using spark context sends the serialized result (code+data) to the workers to execute as Tasks and their result is captured.
- Spark Context: It connects to the cluster manager through the driver to acquire the executors required for processing. Then it sends the serialized result to the workers as tasks to run.
- RDD: Resilient Distributed Datasets are the group of data items that can be stored in memory on worker nodes.
- Directed Acyclic Graph (DAG): DAG is a graph that performs a sequence of computations on data.
Apache spark process data in the from of RDDs using the data flow represented in a Direct Acyclic Graph(DAG).
1.3 Spark Executors:
Spark Executors or the workers are distributed across the cluster. Each executor has a band-width known as a core for processing the data. Based on the core size available to an executor, they pick up tasks from the driver to process the logic of your code on the data and keep data in memory or disk storage across. They can read data from both internal and external storage systems.
2. What Does Spark Driver do?
As soon as we submit our application to Spark job, the driver program is launched with its respective configuration. Then the driver program runs the main() method of your application and creates a SparkContext. Based on your application logic using spark context, transformations and actions are created.
Until an action is called, all the transformations will go into the Spark context in the form of DAG that will create RDD lineage. Once the action is called job is created with multiple tasks. Based on the tasks created, the driver requests the cluster manager to allocate the executors to process these tasks.
Once the resources are allocated, tasks are launched by the cluster manager on the worker nodes along with application configuration and this is done with the help of a class called task scheduler.
The driver will have the metadata of the tasks shared with the executors. Once the tasks are completed by executors the results are shared back with the driver.
- The driver runs the main method of our application.
- The driver creates SparkContext and SparkSession.
- It converts your application code into Task(transformation and action) using DAG.
- Helps to create the DAG’s Execution Plan, Logical Plan, and Physical Plan.
- Driver schedules tasks to executor with the help of cluster manager.
- Driver coordinates with the executor and keeps track of data stored in executors.
This is how the driver helps in the entire execution of the Spark job happens.
3. Spark Driver Configuration
As the Apache spark driver is also a virtual machine with spark installed, we can configure the driver based on our application requirements. Let us see some of the important configurations of the driver
3.1 Number of Spark Driver cores:
A Core is the computational power of the CPU. Number of cores to use for the driver process, only in cluster mode.
The default value for spark.driver.core is 1
We can setup the number of spark driver cores using the spark conf object as below.
//Set Number of cores for spark driver spark.conf.set("spark.driver.cores", 2)
3.2 Spark Driver maxResultSize:
This property defines the max size of serialized result that a spark driver can store.
The default value of spark.driver.maxResultSize is 1GB and the minimum value is 1MB.
We can setup the spark driver maxResultSize using spark conf object as below.
//Set spark driver maxResultSize spark.conf.set("spark.driver.maxResultSize", "8g")
3.3 Spark Driver Memory
spark driver memory property is the maximum limit on the memory usage by Spark Driver. Submitted jobs may abort if the limit is exceeded. Setting it to ‘0’ means, there is no upper limit. But, if the value set by the property is exceeded, out-of-memory may occur in driver.
The default value for spark driver memory is 1GB.
We can setup the spark driver memory using the spark conf object as below.
//Set spark driver memory spark.conf.set("spark.driver.memory", "8g")
Apache Spark driver or PySpark driver is also a machine that helps to process our application logic and implement the tasks. The driver program is the process where the main() method of our Scala, Java, and Python programs runs. It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc.