Sparkling Water can be launched Internal backend and External backend over Spark Cluster. In the external backend, Sparkling Water launches separately by using an external H2O cluster whereas, In the internal backend, it is launched inside a Spark executor
In this Sparkling Water tutorial, we would primarily be focusing on using an external backend.
In external backend mode, H2O cluster runs externally outside of the Spark cluster, this provides more stability cluster as it doesn’t go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration.
1. External Backend Configuration
1.1 Using Property
Internal backend is the default behavior, you can change the behavior by setting
1.2 Using H2OConf
If you wanted to set this property and control it programmatically, you can do this by using a method
setExternalClusterMode() from H2OConf class. H2OConf is a wrapper on
SparkConf and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water.
val h2OConf = new H2OConf(spark).setExternalClusterMode()
2. Downloading External Jar
In order to run a Sparkling Water jobs on the external H2O cluster, you need to download external jar for the Hadoop version you are using. This jar doesn’t come with the default distribution. Run below command without argument to know Hadoop versions supported by your Sparkling water version.
[email protected]:~/sparkling-water-184.108.40.206-1-2.4/bin$ ./get-extended-h2o.sh Download extended H2O driver for Kluster mode. ./get-extended-h2o.sh Parameters: HADOOP_VERSION - Hadoop version (e.g., hdp2.1) or "standalone" - see list below Hadoop distributions supported by H2O: standalone cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 iop4.2
3. Running H2O Cluster on Hadoop
Select the Hadoop version you are using and run
get-extended-h2o.sh with the Hadoop version as an argument. I am running Hadoop 3.1 hence, I will be using “hdp3.1”. This download
h2odriver-sw3.28.0-hdp3.1-extended.jar file to your current directory
[email protected]:~/sparkling-water-220.127.116.11-1-2.4/bin$ ./get-extended-h2o.sh hdp3.1
External backend supports two deployment strategies.
Automatic mode is easy to deploy external backend cluster and this is the recommended approach for production. In automatic mode, the H2O cluster is started automatically in YARN environment when you call
H2OContext.getOrCreate(). When the H2O cluster is started on YARN, it is started as a map-reduce job, and it uses the flat-file approach for nodes to cloud up.
export MASTER="yarn" ./sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=external"
And then create H2OConf object by specifying
scala> import org.apache.spark.h2o._ import org.apache.spark.h2o._ scala> val conf = new H2OConf(sc).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("/home/ubuntu/sparkling-water-18.104.22.168-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar").setClusterSize(2).setMapperXmx("2G").setYARNQueue("default") conf: org.apache.spark.h2o.H2OConf = Sparkling Water configuration: backend cluster mode : external cluster start mode : auto cloudName : Not set yet cloud representative : Not set, using cloud name only clientBasePort : 54321 h2oClientLog : INFO nthreads : -1
val hc = H2OContext.getOrCreate(spark, conf)
As the name suggests, In manual mode, we need to start the H2O cluster manually before connecting to it. In order to start, you need to run the Sparkling Water extended jar with “
hadoop” command. Let’s see it in action
export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-22.214.171.124-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar hadoop jar $H2O_EXTENDED_JAR -sw_ext_backend -jobname test -nodes 2 -mapperXmx 2g
By running the above commands, you will start an H2O cluster with 2 nodes.
Now, open H2O Flow in your web browser: http://192.168.1.137:54321 and check the status of the cluster. It should show the 2 nodes (In my case two nodes are started in same server hence you see same IP but with different ports).
Connect to the H2O cloud
Open another terminal and start
./sparkling-shell and run the below commands.
import org.apache.spark.h2o._ val conf = new H2OConf(spark) .setExternalClusterMode() .useManualClusterStart() .setH2OCluster("192.168.1.137", 54321) .setClusterSize(2) .setCloudName("test") val hc = H2OContext.getOrCreate(spark, conf)
scala> import org.apache.spark.h2o._ import org.apache.spark.h2o._ scala> val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setH2OCluster("192.168.1.137", 54321).setClusterSize(2).setCloudName("test") 2020-02-22 23:27:11,325 WARN h2o.H2OConf: Using external cluster mode! conf: org.apache.spark.h2o.H2OConf = Sparkling Water configuration: backend cluster mode : external cluster start mode : manual cloudName : test cloud representative : 192.168.1.137:54321 clientBasePort : 54321 h2oClientLog : INFO nthreads : -1 scala> val hc = H2OContext.getOrCreate(spark, conf) 2020-02-22 23:27:30,071 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins, we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1. E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) We also recommend to avoid using broadcast hints in your Spark SQL code. 2020-02-22 23:27:30,074 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000 2020-02-22 23:27:31,768 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so 2020-02-22 23:27:31,769 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so 2020-02-22 23:27:31,770 WARN java.NativeLibrary: Failed to load library from both native path and jar! 2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so 2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so 2020-02-22 23:27:31,776 WARN java.NativeLibrary: Failed to load library from both native path and jar! hc: org.apache.spark.h2o.H2OContext = Sparkling Water Context: * Sparkling Water Version: 126.96.36.199-1-2.4 * H2O name: test * cluster size: 2 * list of used nodes: (executorId, host, port) ------------------------ (0,192.168.1.137,54321) (1,192.168.1.137,54323) ------------------------ Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX) scala>
This connects to H2O cloud running on 192.168.1.137 with port 54321
4. Running H2O Cluster on Standalone
If you do not want to run on Hadoop, instead want to run H2O in standalone mode for testing or POC’s. Download standalone external jar.
h2odriver-sw3.28.0-extended.jar into your current directory.
In this article, you have learned the difference between Internal and External backend modes and also learned how to start the H2O cloud manually and automatically and finally starting in standalone mode.
Happy Learning !!