You are currently viewing Start H2O Cluster on Hadoop (External Backend)

In external backend mode, H2O cluster runs externally outside of the Spark application, this provides more stability cluster as it doesn’t go down when Spark executors being kill and provide high availability of H2O cluster. You can set an external backend by using the below configuration.

1. Downloading External Jar

In order to run Sparkling Water jobs on the external H2O cluster, you need to download external jar for the Hadoop version you are using. This jar doesn’t come with the default distribution.

First, run below command without argument to know Hadoop versions supported by your Sparkling water version.


ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh
Download extended H2O driver for Kluster mode.
 ./get-extended-h2o.sh 
 Parameters:
    HADOOP_VERSION - Hadoop version (e.g., hdp2.1) or "standalone" - see list below
 Hadoop distributions supported by H2O:
    standalone cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 iop4.2

2. Running H2O Cluster on Hadoop

Prerequisite:

Select the Hadoop version you are using and run get-extended-h2o.sh with the Hadoop version as an argument. I am running Hadoop 3.1 hence, I will be using “hdp3.1”. This download h2odriver-sw3.28.0-hdp3.1-extended.jar file to your current directory


ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4/bin$ ./get-extended-h2o.sh hdp3.1

External backend supports two deployment strategies.

  • Automatic
  • Manual

2.1 Automatic

Automatic mode is easy to deploy external backend cluster and this is the recommended approach for production. In automatic mode, the H2O cluster is started automatically in YARN environment when you call H2OContext.getOrCreate(). When the H2O cluster is started on YARN, it is started as a map-reduce job, and it uses the flat-file approach for nodes to cloud up.


 export MASTER="yarn"
 ./sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=external"

And then create H2OConf object by specifying useAutoClusterStart() and setH2ODriverPath()


scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val conf = new H2OConf(sc).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath("/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar").setClusterSize(2).setMapperXmx("2G").setYARNQueue("default")
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : external
  cluster start mode   : auto
  cloudName            : Not set yet
  cloud representative : Not set, using cloud name only
  clientBasePort       : 54321
  h2oClientLog         : INFO
  nthreads             : -1


val hc = H2OContext.getOrCreate(spark, conf)

2.2 Manual

As the name suggests, In manual mode, we need to start the H2O cluster manually before connecting to it. In order to start, you need to run the Sparkling Water extended jar with “hadoop” command. Let’s see it in action


export H2O_EXTENDED_JAR=/home/ubuntu/sparkling-water-3.28.0.3-1-2.4/bin/h2odriver-sw3.28.0-hdp3.1-extended.jar
hadoop jar $H2O_EXTENDED_JAR -sw_ext_backend -jobname test -nodes 2 -mapperXmx 2g

By running the above commands, you will start an H2O cluster with 2 nodes.

H2O cluster in Manual Backend mode

Now, open H2O Flow in your web browser: http://192.168.1.137:54321 and check the status of the cluster. It should show the 2 nodes (In my case two nodes are started in same server hence you see same IP but with different ports).

H2O Flow cluster status

Connect to the H2O cloud ( as a Client)

Open another terminal and start ./sparkling-shell and run the below commands.


import org.apache.spark.h2o._
val conf = new H2OConf(spark)
            .setExternalClusterMode()
            .useManualClusterStart()
            .setH2OCluster("192.168.1.137", 54321)
            .setClusterSize(2)
            .setCloudName("test")
val hc = H2OContext.getOrCreate(spark, conf)

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val conf = new H2OConf(spark).setExternalClusterMode().useManualClusterStart().setH2OCluster("192.168.1.137", 54321).setClusterSize(2).setCloudName("test")
2020-02-22 23:27:11,325 WARN h2o.H2OConf: Using external cluster mode!
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : external
  cluster start mode   : manual
  cloudName            : test
  cloud representative : 192.168.1.137:54321
  clientBasePort       : 54321
  h2oClientLog         : INFO
  nthreads             : -1

scala> val hc = H2OContext.getOrCreate(spark, conf)
2020-02-22 23:27:30,071 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 23:27:30,074 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
2020-02-22 23:27:31,768 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 23:27:31,769 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 23:27:31,770 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 23:27:31,775 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 23:27:31,776 WARN java.NativeLibrary: Failed to load library from both native path and jar!
hc: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: test
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,192.168.1.137,54321)
  (1,192.168.1.137,54323)
  ------------------------

  Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)



scala>

This connects to H2O cloud running on 192.168.1.137 with port 54321

Conclusion

In this article, you have learned how to start the H2O cloud manually and automatically on Hadoop and yarn and connecting to the cloud externally buy using cloud IP and port.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium