Running Sparkling Water as Internal Backend

H2O Sparkling Water can be launched Internal backend and External Backend over Spark Cluster. In the internal backend, Sparkling Water is launched inside a Spark executor and in the external backend, it is launched separately by using an external H2O cluster.

In this Sparkling Water tutorial, we would primarily be focusing on using an internal backend.

<image>

In Internal backend deploy mode, At the time of H2OContext object creation, It creates H2O Cluster by getting all Spark executors and starting the H2O instance inside all discovered executors. This mode has a few limitations as

  • You can’t add additional executors once it started.
  • When Spark executors are killed, the entire H2O Cluster goes down

Let’s see how to start Sparkling Water as an internal backend in action.

Using Property

Internal backend is the default behavior of how Sparkling Water deploys, In case if this is not based on your configuration, you can change the behavior by setting spark.ext.h2o.backend.cluster.mode property=internal. for example, go to Sparkling Water installation directory and run the below command to start in the internal backend


cd sparkling-water-3.28.0.3-1-2.4/bin
sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal"

In case if you have a Hadoop setup with Yarn and you can try the same with --master=yarn


cd sparkling-water-3.28.0.3-1-2.4/bin
sparkling-shell --conf "spark.ext.h2o.backend.cluster.mode=internal"  --master=yarn

Using H2OConf

If you wanted to set this property and control it programmatically, you can do this by using a method setInternalClusterMode() from H2OConf class. H2OConf is a wrapper on SparkConf and inherits all properties and provides additional properties related to H2O Cluster & Sparkling Water. And, when you getting H2OContext pass h2OConf object as a parameter to getOrCreate() method.


scala> import org.apache.spark.h2o.H2OConf
import org.apache.spark.h2o.H2OConf

scala> val h2OConf = new H2OConf(spark).setInternalClusterMode()
h2OConf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : internal
  workers              : None
  cloudName            : Not set yet, it will be set automatically before starting H2OContext.
  clientBasePort       : 54321
  nodeBasePort         : 54321
  cloudTimeout         : 60000
  h2oNodeLog           : INFO
  h2oClientLog         : INFO
  nthreads             : -1
  drddMulFactor        : 10

Calling H2OContext.getOrCreate() creates an H2O Cluster automatically in Internal backend mode. This identifies all executors in Spark and starts the H2O instance inside each executor. When Spark exists the executors your H2O cluster exists as-well.


val h2OContext = H2OContext.getOrCreate(spark,h2OConf)

This yields the below output. As you see it


scala> import org.apache.spark.h2o.H2OContext
import org.apache.spark.h2o.H2OContext

scala> val h2OContext = H2OContext.getOrCreate(spark,h2OConf)
2020-02-22 17:32:22,537 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
2020-02-22 17:32:22,539 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
2020-02-22 17:32:22,541 WARN internal.InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
2020-02-22 17:32:30,020 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_gpu.so
2020-02-22 17:32:30,022 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_gpu.so
2020-02-22 17:32:30,022 WARN java.NativeLibrary: Failed to load library from both native path and jar!
2020-02-22 17:32:30,023 WARN java.NativeLibrary: Cannot load library from path lib/linux_64/libxgboost4j_omp.so
2020-02-22 17:32:30,024 WARN java.NativeLibrary: Cannot load library from path lib/libxgboost4j_omp.so
2020-02-22 17:32:30,025 WARN java.NativeLibrary: Failed to load library from both native path and jar!
h2OContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: sparkling-water-ubuntu_application_1582391614625_0002
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (1,192.168.1.136,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.1.101:54321 (CMD + click in Mac OSX)


 * Yarn App ID of Spark application: application_1582391614625_0002


scala>

From the output, get the H2O Flow URL and access it from browser: http://localhost:54321 (change to the port it’s running on your system)

Conclusion

I hope you able to start the cluster in Internal backend mode if you face any issues please leave me a comment, I will happy to help.

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply