Spark Deploy Modes – Client vs Cluster Explained

Difference between Client vs Cluster deploy modes in Spark/PySpark is the most asked interview question – Spark deployment mode (--deploy-mode) specifies where to run the driver program of your Spark application/job, Spark provides two deployment modes, client and cluster, you could use these to run Java, Scala, and PySpark applications.

Using <strong>spark-submit --deploy-mode <client/cluster></strong> , you can specify where to run the Spark application driver program.

Spark/PySpark Deploy Modes

ValueDescription
clusterIn cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.
clientIn client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes. Note that in client mode only the driver runs locally and all tasks run on cluster worker nodes.

If you wanted to know the deploy mode of running or completed Spark application, you can get it by accessing Spark Web UI from Spark History Server UI and check for spark.submit.deployMode property on Environment tab

Client Deploy Mode in Spark

In client mode, the Spark driver component of the spark application will run on the machine from where the job submitted.

In a typical Cloudera cluster, you submit the Spark application from the Edge node hence the Spark driver will run on an edge node.

In a Spark Standalone Cluster, the driver runs on a master node (dedicated server) with dedicated resources.


spark-submit --deploy-mode client --driver-memory xxxx  ......
  • The default deployment mode is client mode.
  • In client mode, if a machine or a user session running spark-submit terminates, your application also terminates with status fail.
  • Using Ctrl-c after submitting the spark-submit command also terminates your application.
  • Client mode is not used for Production jobs. This is used for testing purposes.
  • Driver logs are accessible from the local machine itself.

Note: Network Overhead – As data needs to be moved between the driver and the worker nodes across the network (between the submitting machine(driver machine) and the cluster), depending on the network latency you may notice performance degradation.

Cluster Deploy Mode in Spark:

In Cluster Deploy mode, the driver program would be launched on any one of the spark cluster nodes (on any of the available nodes in the cluster). Cluster deployment is mostly used for large data sets where the job takes few mins/hrs to complete.


spark-submit --deploy-mode cluster --driver-memory xxxx  ........
  • Terminating the current session doesn’t terminate the application. The application would be running on the cluster. You can get the status of the spark application by running <strong>spark-submit --status [submission ID]</strong>
  • Since Spark driver runs on one of the worker node within the cluster, which reduces the data movement overhead between submitting machine and the cluster.
  • For the Cloudera cluster, you should use yarn commands to access driver logs.
  • In this spark mode, the change of network disconnection between driver and spark infrastructure reduces. As they reside in the same infrastructure(cluster), It highly reduces the chance of job failure.

Hope you like the above explanation of Spark/PySpark Cluster and Client Deploy mode differences !!

Conclusion

In this article, you have learned the difference between Spark/PySpark Client vs Cluster mode, In Client mode, Spark runs driver in local machine, and in cluster mode, it runs driver on one of the nodes in the cluster.

Happy Learning !!

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply