Sparkling Water can be launched Internal backend and External backend over Spark Cluster. In the external backend, Sparkling Water launches separately by using an external H2O cluster whereas, In the internal backend, it is launched inside a Spark executor
In this Sparkling Water tutorial, we would primarily be focusing on using an external backend.
In external backend mode, H2O cluster runs externally outside of the Spark cluster, this provides more stability cluster as it doesn’t go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration.
1. External Backend Configuration
1.1 Using Property
Internal backend is the default behavior, you can change the behavior by setting spark.ext.h2o.backend.cluster.mode property=external
.
1.2 Using H2OConf
If you wanted to set this property and control it programmatically, you can do this by using a method setExternalClusterMode()
from H2OConf class. H2OConf is a wrapper on SparkConf
and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water.
val h2OConf = new H2OConf(spark).setExternalClusterMode()
2. Downloading External Jar
In order to run a Sparkling Water jobs on the external H2O cluster, you need to download external jar for the Hadoop version you are using. This jar doesn’t come with the default distribution. Run below command without argument to know Hadoop versions supported by your Sparkling water version.
ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh
Download extended H2O driver for Kluster mode.
./get-extended-h2o.sh
Parameters:
HADOOP_VERSION - Hadoop version (e.g., hdp2.1) or "standalone" - see list below
Hadoop distributions supported by H2O:
standalone cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 iop4.2
3. Running H2O Cluster on Hadoop
Select the Hadoop version you are using and run get-extended-h2o.sh
with the Hadoop version as an argument. I am running Hadoop 3.1 hence, I will be using “hdp3.1”. This download h2odriver-sw3.28.0-hdp3.1-extended.jar
file to your current directory
ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh hdp3.1
External backend supports two deployment strategies.
- Automatic
- Manual
3.1 Automatic
Prerequisite:
Automatic mode is easy to deploy external backend cluster and this is the recommended approach for production. In automatic mode, the H2O cluster is started automatically in YARN environment when you call H2OContext.getOrCreate()
. When the H2O cluster is started on YARN, it is started as a map-reduce job, and it uses the flat-file approach for nodes to cloud up.
export MASTER="yarn"
./sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=external"
And then create H2OConf object by specifying useAutoClusterStart()
and setH2ODriverPath()
scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala> val conf = new H2OConf(sc).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar").setClusterSize(2).setMapperXmx("2G").setYARNQueue("default")
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : external
cluster start mode : auto
cloudName : Not set yet
cloud representative : Not set, using cloud name only
clientBasePort : 54321
h2oClientLog : INFO
nthreads : -1
val hc = H2OContext.getOrCreate(spark, conf)
3.2 Manual
As the name suggests, In manual mode, we need to start the H2O cluster manually before connecting to it. In order to start, you need to run the Sparkling Water extended jar with “hadoop
” command. Let’s see it in action
export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar
hadoop jar $H2O_EXTENDED_JAR -sw_ext_backend -jobname test -nodes 2 -mapperXmx 2g
By running the above commands, you will start an H2O cluster with 2 nodes.

Now, open H2O Flow in your web browser: http://192.168.1.137:54321 and check the status of the cluster. It should show the 2 nodes (In my case two nodes are started in same server hence you see same IP but with different ports).

Connect to the H2O cloud
Open another terminal and start ./sparkling-shell
and run the below commands.
import org.apache.spark.h2o._
val conf = new H2OConf(spark)
.setExternalClusterMode()
.useManualClusterStart()
.setH2OCluster("192.168.1.137", 54321)
.setClusterSize(2)
.setCloudName("test")
val hc = H2OContext.getOrCreate(spark, conf)
scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala> val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setH2OCluster("192.168.1.137", 54321).setClusterSize(2).setCloudName("test")
2020-02-22 23:27:11,325 WARN h2o.H2OConf: Using external cluster mode!
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : external
cluster start mode : manual
cloudName : test
cloud representative : 192.168.1.137:54321
clientBasePort : 54321
h2oClientLog : INFO
nthreads : -1
scala> val hc = H2OContext.getOrCreate(spark, conf)
2020-02-22 23:27:30,071 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 23:27:30,074 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
2020-02-22 23:27:31,768 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 23:27:31,769 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 23:27:31,770 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 23:27:31,776 WARN java.NativeLibrary: Failed to load library from both native path and jar!
hc: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* Sparkling Water Version: 3.28.0.3-1-2.4
* H2O name: test
* cluster size: 2
* list of used nodes:
(executorId, host, port)
------------------------
(0,192.168.1.137,54321)
(1,192.168.1.137,54323)
------------------------
Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)
scala>
This connects to H2O cloud running on 192.168.1.137 with port 54321
4. Running H2O Cluster on Standalone
If you do not want to run on Hadoop, instead want to run H2O in standalone mode for testing or POC’s. Download standalone external jar.
./bin/get-extended-h2o.sh standalone
This downloads h2odriver-sw3.28.0-extended.jar
into your current directory.
Conclusion
In this article, you have learned the difference between Internal and External backend modes and also learned how to start the H2O cloud manually and automatically and finally starting in standalone mode.
Related Articles
Reference:
Happy Learning !!