H2O Sparkling Water can be launched Internal backend and External Backend over Spark Cluster. In the internal backend, Sparkling Water is launched inside a Spark executor and in the external backend, it is launched separately by using an external H2O cluster.
In this Sparkling Water tutorial, we would primarily be focusing on using an internal backend.
In Internal backend deploy mode, At the time of H2OContext object creation, It creates H2O Cluster by getting all Spark executors and starting the H2O instance inside all discovered executors. This mode has a few limitations as
- You can’t add additional executors once it started.
- When Spark executors are killed, the entire H2O Cluster goes down
Let’s see how to start Sparkling Water as an internal backend in action.
Internal backend is the default behavior of how Sparkling Water deploys, In case if this is not based on your configuration, you can change the behavior by setting
spark.ext.h2o.backend.cluster.mode property=internal. for example, go to Sparkling Water installation directory and run the below command to start in the internal backend
cd sparkling-water-22.214.171.124-1-2.4/bin sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal"
In case if you have a Hadoop setup with Yarn and you can try the same with
cd sparkling-water-126.96.36.199-1-2.4/bin sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal" --master=yarn
If you wanted to set this property and control it programmatically, you can do this by using a method
setInternalClusterMode() from H2OConf class. H2OConf is a wrapper on
SparkConf and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water. And, when you getting H2OContext pass h2OConf object as a parameter to
scala> import org.apache.spark.h2o.H2OConf import org.apache.spark.h2o.H2OConf scala> val h2OConf = new H2OConf(spark).setInternalClusterMode() h2OConf: org.apache.spark.h2o.H2OConf = Sparkling Water configuration: backend cluster mode : internal workers : None cloudName : Not set yet, it will be set automatically before starting H2OContext. clientBasePort : 54321 nodeBasePort : 54321 cloudTimeout : 60000 h2oNodeLog : INFO h2oClientLog : INFO nthreads : -1 drddMulFactor : 10
H2OContext.getOrCreate() creates an H2O Cluster automatically in Internal backend mode. This identifies all executors in Spark and starts the H2O instance inside each executor. When Spark exists the executors your H2O cluster exists as-well.
val h2OContext = H2OContext.getOrCreate(spark,h2OConf)
This yields the below output. As you see it
scala> import org.apache.spark.h2o.H2OContext import org.apache.spark.h2o.H2OContext scala> val h2OContext = H2OContext.getOrCreate(spark,h2OConf) 2020-02-22 17:32:22,537 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins, we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1. E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) We also recommend to avoid using broadcast hints in your Spark SQL code. 2020-02-22 17:32:22,539 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O 2020-02-22 17:32:22,541 WARN internal.InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified! We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1` 2020-02-22 17:32:30,020 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so 2020-02-22 17:32:30,022 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so 2020-02-22 17:32:30,022 WARN java.NativeLibrary: Failed to load library from both native path and jar! 2020-02-22 17:32:30,023 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so 2020-02-22 17:32:30,024 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so 2020-02-22 17:32:30,025 WARN java.NativeLibrary: Failed to load library from both native path and jar! h2OContext: org.apache.spark.h2o.H2OContext = Sparkling Water Context: * Sparkling Water Version: 188.8.131.52-1-2.4 * H2O name: sparkling-water-ubuntu_application_1582391614625_0002 * cluster size: 1 * list of used nodes: (executorId, host, port) ------------------------ (1,192.168.1.136,54321) ------------------------ Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX) * Yarn App ID of Spark application: application_1582391614625_0002 scala>
From the output, get the H2O Flow URL and access it from browser: http://localhost:54321 (change to the port it’s running on your system)
I hope you able to start the cluster in Internal backend mode if you face any issues please leave me a comment, I will happy to help.
Happy Learning !!