H2O Sparkling Water can be launched Internal backend and External Backend over Spark Cluster. In the internal backend, Sparkling Water is launched inside a Spark executor and in the external backend, it is launched separately by using an external H2O cluster.
In this Sparkling Water tutorial, we would primarily be focusing on using an internal backend.
<image>
In Internal backend deploy mode, At the time of H2OContext object creation, It creates H2O Cluster by getting all Spark executors and starting the H2O instance inside all discovered executors. This mode has a few limitations as
- You can’t add additional executors once it started.
- When Spark executors are killed, the entire H2O Cluster goes down
Let’s see how to start Sparkling Water as an internal backend in action.
Using Property
Internal backend is the default behavior of how Sparkling Water deploys, In case if this is not based on your configuration, you can change the behavior by setting spark.ext.h2o.backend.cluster.mode property=internal
. for example, go to Sparkling Water installation directory and run the below command to start in the internal backend
cd sparkling-water-3.28.0.3-1-2.4/bin
sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal"
In case if you have a Hadoop setup with Yarn and you can try the same with --master=yarn
cd sparkling-water-3.28.0.3-1-2.4/bin
sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal" --master=yarn
Using H2OConf
If you wanted to set this property and control it programmatically, you can do this by using a method setInternalClusterMode()
from H2OConf class. H2OConf is a wrapper on SparkConf
and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water. And, when you getting H2OContext pass h2OConf object as a parameter to getOrCreate()
method.
scala> import org.apache.spark.h2o.H2OConf
import org.apache.spark.h2o.H2OConf
scala> val h2OConf = new H2OConf(spark).setInternalClusterMode()
h2OConf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : internal
workers : None
cloudName : Not set yet, it will be set automatically before starting H2OContext.
clientBasePort : 54321
nodeBasePort : 54321
cloudTimeout : 60000
h2oNodeLog : INFO
h2oClientLog : INFO
nthreads : -1
drddMulFactor : 10
Calling H2OContext.getOrCreate()
creates an H2O Cluster automatically in Internal backend mode. This identifies all executors in Spark and starts the H2O instance inside each executor. When Spark exists the executors your H2O cluster exists as-well.
val h2OContext = H2OContext.getOrCreate(spark,h2OConf)
This yields the below output. As you see it
scala> import org.apache.spark.h2o.H2OContext
import org.apache.spark.h2o.H2OContext
scala> val h2OContext = H2OContext.getOrCreate(spark,h2OConf)
2020-02-22 17:32:22,537 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 17:32:22,539 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
2020-02-22 17:32:22,541 WARN internal.InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
2020-02-22 17:32:30,020 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 17:32:30,022 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 17:32:30,022 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 17:32:30,023 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 17:32:30,024 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 17:32:30,025 WARN java.NativeLibrary: Failed to load library from both native path and jar!
h2OContext: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* Sparkling Water Version: 3.28.0.3-1-2.4
* H2O name: sparkling-water-ubuntu_application_1582391614625_0002
* cluster size: 1
* list of used nodes:
(executorId, host, port)
------------------------
(1,192.168.1.136,54321)
------------------------
Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)
* Yarn App ID of Spark application: application_1582391614625_0002
scala>
From the output, get the H2O Flow URL and access it from browser: http://localhost:54321 (change to the port it’s running on your system)
Conclusion
I hope you able to start the cluster in Internal backend mode if you face any issues please leave me a comment, I will happy to help.
Happy Learning !!