You are currently viewing H2O Sparkling water Introduction
H20 Sparkling Water Introduction

Sparkling Water contains the same features and functionality as H2O and it enables users to run H2O machine learning algorithms API on top of the Spark cluster allowing H2O to benefit from Spark capabilities like fast, scalable and distributed in-memory processing.

Sparling Water also enables users to run H2O Machine Learning models using Java, Scala, R and Python languages.

Integrating these two open-source environments (Spark & H2O) provides a seamless experience for users who want to make a query using Spark SQL, feed the results into H2O to build a model and make predictions, and then use the results again in Spark. For any given problem, better interoperability between tools provides a better experience.

– H2O Sparkling Water
Sparkling Water  Architecture
Source: H2O.ai

Installing & Running Sparkling Water Shell on Windows

In order to run Sparkling Shell, you need to have an Apache Spark installed on your computer and have the SPARK_HOME environment variable set to the Spark home directory. If you do not have it installed, download it from here, unzip and set SPARK_HOME environment variable to your Spark directory.

Now, download H2O Sparkling Water and unzip the downloaded file. In my case, I’ve download Sparkling Water version 3.28 which supports Spark 2.4.4 and unzip into C:\apps\opt\sparkling-water


cd C:\apps\opt\sparkling-water\bin
C:\apps\opt\sparkling-water\bin>sparkling-shell

-----
  Spark master (MASTER)     : local[*]
  Spark home   (SPARK_HOME) : C:\apps\opt\spark-2.4.4-bin-hadoop2.7
  H2O build version         : 3.28.0.3 (yu)
  Spark build version       : 2.4.4
  Scala version             : 2.11
----

20/02/13 07:34:48 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLeve
l(newLevel).
Spark context Web UI available at http://DELL-ESUHAO2KAJ:4040
Spark context available as 'sc' (master = local[*], app id = local-1581608102876
).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Now let’s create H2OContext by taking SparkSession object “spark” as a parameter, This creates an H2O Cloud inside the Spark Cluster.


scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark)
h2oContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: sparkling-water-prabha_local-1581608102876
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.1,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.1:54321 (CMD + click in Mac OSX)

scala>

This also runs an H2O Flow web UI interface to interact with H2O. Open H2O Flow in browser: http://192.168.56.1:54321 (change the IP address to your system IP)

Sparkling Water H20 Flow
H2O Flow

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium