In external backend mode, H2O cluster runs externally outside of the Spark application, this provides more stability cluster as it doesn’t go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration.
1. Downloading External Jar
In order to run Sparkling Water jobs on the external H2O cluster, you need to download external jar for the Hadoop version you are using. This jar doesn’t come with the default distribution.
First, run below command without argument to know Hadoop versions supported by your Sparkling water version.
ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh
Download extended H2O driver for Kluster mode.
./get-extended-h2o.sh
Parameters:
HADOOP_VERSION - Hadoop version (e.g., hdp2.1) or "standalone" - see list below
Hadoop distributions supported by H2O:
standalone cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 iop4.2
2. Running H2O Cluster on Hadoop
Prerequisite:
Select the Hadoop version you are using and run get-extended-h2o.sh
with the Hadoop version as an argument. I am running Hadoop 3.1 hence, I will be using “hdp3.1”. This download h2odriver-sw3.28.0-hdp3.1-extended.jar
file to your current directory
ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh hdp3.1
External backend supports two deployment strategies.
- Automatic
- Manual
2.1 Automatic
Automatic mode is easy to deploy external backend cluster and this is the recommended approach for production. In automatic mode, the H2O cluster is started automatically in YARN environment when you call H2OContext.getOrCreate()
. When the H2O cluster is started on YARN, it is started as a map-reduce job, and it uses the flat-file approach for nodes to cloud up.
export MASTER="yarn"
./sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=external"
And then create H2OConf object by specifying useAutoClusterStart()
and setH2ODriverPath()
scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala> val conf = new H2OConf(sc).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar").setClusterSize(2).setMapperXmx("2G").setYARNQueue("default")
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : external
cluster start mode : auto
cloudName : Not set yet
cloud representative : Not set, using cloud name only
clientBasePort : 54321
h2oClientLog : INFO
nthreads : -1
val hc = H2OContext.getOrCreate(spark, conf)
2.2 Manual
As the name suggests, In manual mode, we need to start the H2O cluster manually before connecting to it. In order to start, you need to run the Sparkling Water extended jar with “hadoop
” command. Let’s see it in action
export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar
hadoop jar $H2O_EXTENDED_JAR -sw_ext_backend -jobname test -nodes 2 -mapperXmx 2g
By running the above commands, you will start an H2O cluster with 2 nodes.

Now, open H2O Flow in your web browser: http://192.168.1.137:54321 and check the status of the cluster. It should show the 2 nodes (In my case two nodes are started in same server hence you see same IP but with different ports).

Connect to the H2O cloud ( as a Client)
Open another terminal and start ./sparkling-shell
and run the below commands.
import org.apache.spark.h2o._
val conf = new H2OConf(spark)
.setExternalClusterMode()
.useManualClusterStart()
.setH2OCluster("192.168.1.137", 54321)
.setClusterSize(2)
.setCloudName("test")
val hc = H2OContext.getOrCreate(spark, conf)
scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala> val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setH2OCluster("192.168.1.137", 54321).setClusterSize(2).setCloudName("test")
2020-02-22 23:27:11,325 WARN h2o.H2OConf: Using external cluster mode!
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : external
cluster start mode : manual
cloudName : test
cloud representative : 192.168.1.137:54321
clientBasePort : 54321
h2oClientLog : INFO
nthreads : -1
scala> val hc = H2OContext.getOrCreate(spark, conf)
2020-02-22 23:27:30,071 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 23:27:30,074 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
2020-02-22 23:27:31,768 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 23:27:31,769 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 23:27:31,770 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 23:27:31,776 WARN java.NativeLibrary: Failed to load library from both native path and jar!
hc: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* Sparkling Water Version: 3.28.0.3-1-2.4
* H2O name: test
* cluster size: 2
* list of used nodes:
(executorId, host, port)
------------------------
(0,192.168.1.137,54321)
(1,192.168.1.137,54323)
------------------------
Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)
scala>
This connects to H2O cloud running on 192.168.1.137 with port 54321
Conclusion
In this article, you have learned how to start the H2O cloud manually and automatically on Hadoop and yarn and connecting to the cloud externally buy using cloud IP and port.
Related Articles
Reference:
Happy Learning !!