You are currently viewing Running Sparkling Water as External Backend

Sparkling Water can be launched Internal backend and External backend over Spark Cluster. In the external backend, Sparkling Water launches separately by using an external H2O cluster whereas, In the internal backend, it is launched inside a Spark executor

Advertisements

In this Sparkling Water tutorial, we would primarily be focusing on using an external backend.

In external backend mode, H2O cluster runs externally outside of the Spark cluster, this provides more stability cluster as it doesn’t go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration.

1. External Backend Configuration

1.1 Using Property

Internal backend is the default behavior, you can change the behavior by setting spark.ext.h2o.backend.cluster.mode property=external.

1.2 Using H2OConf

If you wanted to set this property and control it programmatically, you can do this by using a method setExternalClusterMode() from H2OConf class. H2OConf is a wrapper on SparkConf and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water.


val h2OConf = new H2OConf(spark).setExternalClusterMode()

2. Downloading External Jar

In order to run a Sparkling Water jobs on the external H2O cluster, you need to download external jar for the Hadoop version you are using. This jar doesn’t come with the default distribution. Run below command without argument to know Hadoop versions supported by your Sparkling water version.


ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh
Download extended H2O driver for Kluster mode.
 ./get-extended-h2o.sh 
 Parameters:
    HADOOP_VERSION - Hadoop version (e.g., hdp2.1) or "standalone" - see list below
 Hadoop distributions supported by H2O:
    standalone cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 iop4.2

3. Running H2O Cluster on Hadoop

Select the Hadoop version you are using and run get-extended-h2o.sh with the Hadoop version as an argument. I am running Hadoop 3.1 hence, I will be using “hdp3.1”. This download h2odriver-sw3.28.0-hdp3.1-extended.jar file to your current directory


ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh hdp3.1

External backend supports two deployment strategies.

  • Automatic
  • Manual

3.1 Automatic

Prerequisite:

Automatic mode is easy to deploy external backend cluster and this is the recommended approach for production. In automatic mode, the H2O cluster is started automatically in YARN environment when you call H2OContext.getOrCreate(). When the H2O cluster is started on YARN, it is started as a map-reduce job, and it uses the flat-file approach for nodes to cloud up.


 export MASTER="yarn"
 ./sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=external"

And then create H2OConf object by specifying useAutoClusterStart() and setH2ODriverPath()


scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val conf = new H2OConf(sc).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar").setClusterSize(2).setMapperXmx("2G").setYARNQueue("default")
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : external
  cluster start mode   : auto
  cloudName            : Not set yet
  cloud representative : Not set, using cloud name only
  clientBasePort       : 54321
  h2oClientLog         : INFO
  nthreads             : -1


val hc = H2OContext.getOrCreate(spark, conf)

3.2 Manual

As the name suggests, In manual mode, we need to start the H2O cluster manually before connecting to it. In order to start, you need to run the Sparkling Water extended jar with “hadoop” command. Let’s see it in action


export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar
hadoop jar $H2O_EXTENDED_JAR -sw_ext_backend -jobname test -nodes 2 -mapperXmx 2g

By running the above commands, you will start an H2O cluster with 2 nodes.

H2O cluster in Manual Backend mode

Now, open H2O Flow in your web browser: http://192.168.1.137:54321 and check the status of the cluster. It should show the 2 nodes (In my case two nodes are started in same server hence you see same IP but with different ports).

H2O Flow cluster status

Connect to the H2O cloud

Open another terminal and start ./sparkling-shell and run the below commands.


import org.apache.spark.h2o._
val conf = new H2OConf(spark)
            .setExternalClusterMode()
            .useManualClusterStart()
            .setH2OCluster("192.168.1.137", 54321)
            .setClusterSize(2)
            .setCloudName("test")
val hc = H2OContext.getOrCreate(spark, conf)

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setH2OCluster("192.168.1.137", 54321).setClusterSize(2).setCloudName("test")
2020-02-22 23:27:11,325 WARN h2o.H2OConf: Using external cluster mode!
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : external
  cluster start mode   : manual
  cloudName            : test
  cloud representative : 192.168.1.137:54321
  clientBasePort       : 54321
  h2oClientLog         : INFO
  nthreads             : -1

scala> val hc = H2OContext.getOrCreate(spark, conf)
2020-02-22 23:27:30,071 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 23:27:30,074 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
2020-02-22 23:27:31,768 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 23:27:31,769 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 23:27:31,770 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 23:27:31,776 WARN java.NativeLibrary: Failed to load library from both native path and jar!
hc: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: test
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,192.168.1.137,54321)
  (1,192.168.1.137,54323)
  ------------------------

  Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)



scala>

This connects to H2O cloud running on 192.168.1.137 with port 54321

4. Running H2O Cluster on Standalone

If you do not want to run on Hadoop, instead want to run H2O in standalone mode for testing or POC’s. Download standalone external jar.


./bin/get-extended-h2o.sh standalone 

This downloads h2odriver-sw3.28.0-extended.jar into your current directory.

export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-extended.jar java -jar $H2O_EXTENDED_JAR -allow_clients -name test

Conclusion

In this article, you have learned the difference between Internal and External backend modes and also learned how to start the H2O cloud manually and automatically and finally starting in standalone mode.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium