A cluster in Databricks is a group of virtual machines that are configured with Spark/PySpark and has a combination of computation resources and configuration on which your application can run. In a simple way, the cluster executes all of your databricks code.
Workloads that the Databricks cluster can run are ETL pipelines, Machine Learning models, Streaming, Batch analytics, and ad-hoc analytics.
Table of contents
1. Types of Clusters in Databricks?
There are mainly two types of clusters in Databricks
- Interactive/All-Purpose Clusters: These are mainly used to analyze data interactively using databricks notebooks. We can create these clusters using the Databricks UI, CLI, or REST API commands and also, can manually stop and restart these clusters. Multiple users can share these clusters to do collaborative interactive analysis.
- Job Clusters: Databricks job scheduler creates these clusters when we run a job on a new job cluster. These are mainly used for running fast and robust automated tasks. They are created when we run a job on your new Job Cluster and terminate the Cluster once the job ends. These clusters cannot be restarted.
Under compute section in the left panel of Databricks, we can see the option for All-Purpose Clusters and Job compute status.
2. Modes in Databricks Cluster?
Based on the cluster usage, there are three modes of clusters that Databricks supports. Which are
- Standard clusters: These are now called as No Isolation Shared access mode.
- High concurent clusters: These are now called Shared access mode clusters.
- Single Node clusters.
- Once a cluster is created, we cannot change the cluster mode. If we want a different cluster mode, then we must create a new cluster.
- Standard and Single Node clusters terminate automatically after 120 minutes by default.
- High Concurrency clusters do not terminate automatically by default.
2.1 Standard Mode Databricks Cluster
Standard cluster mode is also called as No Isolation shared cluster, Which means these clusters can be shared by multiple users with no isolation between the users.
In the case of single users, the standard mode is suggested. Workload supports in these modes of clusters are in Python, SQL, R, and Scala can all be run on standard clusters.
2.2. High Concurrent Mode Databricks Cluster
High concurrent mode clusters are complete cloud-managed resources. They share VMs across the network so that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
High concurrent cluster, in addition to performance gains, also allows us to utilize table access control, which is not supported in Standard clusters. Workloads supported in these modes of clusters are in Python, SQL and R. It doesn’t support scala as the performance and security of High Concurrency mode clusters are provided by running user code in separate processes, which is not possible in Scala.
2.3. Single node Databricks Cluster
Single node clusters as the name suggests will only have one node i.e for the driver. There would be no worker node available in this mode. In this mode, the spark job runs on the driver note itself.
This mode is more helpful in the case of small data analysis and Single-node machine learning workloads that use Spark to load and save data.
In this article, we have learned the types of Databricks clusters and the different modes of clusters available. Each mode has its own way of application usage. For production applications, High concurrent mode is preferred as it has high performance and security. For other enhancements and supports, Standard mode is sufficient when the size of data is low.