Spark Deploy Modes - Client vs Cluster Explained

| *** Please Subscribe for Ad Free & Premium Content ***

Post author:Naveen Nelamali
Post category:Apache Spark
Post last modified:April 30, 2024
Reading time:6 mins read

You are currently viewing Spark Deploy Modes – Client vs Cluster Explained

The difference between Client vs Cluster deploy modes in Spark/PySpark is the most asked Spark interview question – Spark deployment mode (--deploy-mode) specifies where to run the driver program of your Spark application/job, Spark provides two deployment modes, client and cluster, you could use these to run Java, Scala, and PySpark applications.

1. Client Deploy Mode in Spark

In client mode, the Spark driver component of the spark application will run on the machine from where the job submitted.

In a typical Cloudera cluster, you submit the Spark application from the Edge node hence the Spark driver will run on an edge node.

In a Spark Standalone Cluster, the driver runs on a master node (dedicated server) with dedicated resources.


# client deployment mode usage
spark-submit --deploy-mode client --driver-memory xxxx  ......

The default deployment mode is client mode.
In client mode, if a machine or a user session running spark-submit terminates, your application also terminates with status fail.
Using Ctrl-c after submitting the spark-submit command also terminates your application.
Client mode is not used for Production jobs. This is used for testing purposes.
Driver logs are accessible from the local machine itself.

Note: Network Overhead – As data needs to be moved between the driver and the worker nodes across the network (between the submitting machine(driver machine) and the cluster), depending on the network latency you may notice performance degradation.

2. Cluster Deploy Mode in Spark:

In Cluster Deploy mode, the driver program would be launched on any one of the spark cluster nodes (on any of the available nodes in the cluster). Cluster deployment is mostly used for large data sets where the job takes few mins/hrs to complete.


# cluster deployment mode usage
spark-submit --deploy-mode cluster --driver-memory xxxx  ........

Terminating the current session doesn’t terminate the application. The application would be running on the cluster. You can get the status of the spark application by running <strong>spark-submit --status [submission ID]</strong>
Since Spark driver runs on one of the worker node within the cluster, which reduces the data movement overhead between submitting machine and the cluster.
For the Cloudera cluster, you should use yarn commands to access driver logs.
In this spark mode, the change of network disconnection between driver and spark infrastructure reduces. As they reside in the same infrastructure(cluster), It highly reduces the chance of job failure.

Hope you like the above explanation of Spark/PySpark Cluster and Client Deploy mode differences !!

Conclusion

In this article, you have learned the difference between Spark/PySpark Client vs Cluster mode, In Client mode, Spark runs driver in local machine, and in cluster mode, it runs driver on one of the nodes in the cluster.

Happy Learning !!

Reference

https://stackoverflow.com/questions/37420537/how-to-check-status-of-spark-applications-from-the-command-line/37420931

1. Client Deploy Mode in Spark

2. Cluster Deploy Mode in Spark:

Conclusion

Related Articles

Reference