H2O Sparkling Water Tutorial for Beginners

In this H2O Sparkling Water Tutorial, you will learn Sparkling Water (Spark with Scala) examples and every example explain here are available at Spark-examples Github project for reference.

What is Machine Learning & Artificial Intelligence

Machine Learning is an application of Artificial Intelligence which are used to perform a specific task based on the experience by analyzing the

Application of Artificial Intelligence to perform a specific task
Which automatically learns and improve from past experience
Without explicit programming for each dataset
ML models look for patterns in data and make better decisions
It is a subset of Artificial Intelligence

Artificial Intelligence is a field devoted to building a machine to exhibit human natural intelligence which can read and understand human language (speech recognition), problem-solving, learning from past experience and many more.

What is H2O

H2O is a leading open-source Machine Learning & Artificial Intelligence platform created by H2O.ai that includes mostly widely used Machine Learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Na¨ıve Bayes, principal components analysis, k-means clustering, and word2vec

H2O runs on distributed in-memory and handles billions of data rows and designed to run in standalone mode, on Hadoop, or within a Spark Cluster.

H2O also comes with Flow (a.k.a H2O Flow) which is a web-based interactive user interface that enables you to execute and view the graphs and plots in a single page.

H20 design and architecture — Source: H2O.ai

What is Apache Spark

Apache Spark is an open-source, reliable, scalable and distributed general-purpose computing engine used for processing and analyzing big data files from different sources like HDFS, S3, Azure e.t.c

Apache Spark Components — Spark Components

Above is an architecture of a Spark application running on the cluster. For more details on Apache Spark read https://spark.apache.org/docs/latest/quick-start.html

Wha is Sparkling Water

Sparkling Water contains the same features and functionality as H2O and it enables users to run H2O machine learning algorithms API on top of the Spark cluster allowing H2O to benefit from Spark capabilities like fast, scalable and distributed in-memory processing.

Sparling Water also enables users to run H2O Machine Learning models using Java, Scala, R and Python languages.

Integrating these two open-source environments (Spark & H2O) provides a seamless experience for users who want to make a query using Spark SQL, feed the results into H2O to build a model and make predictions, and then use the results again in Spark. For any given problem, better interoperability between tools provides a better experience.
– H2O Sparkling Water

Sparkling Water Architecture — Source: H2O.ai

Installing & Running Sparkling Water Shell on Windows

In order to run Sparkling Shell, you need to have an Apache Spark installed on your computer and have the SPARK_HOME environment variable set to the Spark home directory. If you do not have it installed, download it from spark.apache.org, unzip and set SPARK_HOME environment variable to your Spark directory.

Now, download H2O Sparkling Water and unzip the downloaded file. In my case, I’ve download Sparkling Water version 3.28 which supports Spark 2.4.4 and unzip into C:\apps\opt\sparkling-water


cd C:\apps\opt\sparkling-water\bin
C:\apps\opt\sparkling-water\bin>sparkling-shell

-----
  Spark master (MASTER)     : local[*]
  Spark home   (SPARK_HOME) : C:\apps\opt\spark-2.4.4-bin-hadoop2.7
  H2O build version         : 3.28.0.3 (yu)
  Spark build version       : 2.4.4
  Scala version             : 2.11
----

20/02/13 07:34:48 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLeve
l(newLevel).
Spark context Web UI available at http://DELL-ESUHAO2KAJ:4040
Spark context available as 'sc' (master = local[*], app id = local-1581608102876
).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Now let’s create H2OContext by taking SparkSession object “spark” as a parameter, This creates an H2O Cloud inside the Spark Cluster.


scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark)
h2oContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: sparkling-water-prabha_local-1581608102876
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.1,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.1:54321 (CMD + click in Mac OSX)

scala>

This also runs an H2O Flow web UI interface to interact with H2O. Open H2O Flow in browser: http://192.168.56.1:54321 (change the IP address to your system IP)

## Add link to H2O Flow ##

H2O Sparkling Water supporting file formats

Similar to Spark, H2O Sparkling Water can read data stored in the following formats:

ARFF
CSV and GZipped CSV
SVMLight
XLS
XLSX
ORC
Avro version
Parquet

H2O Sparkling Water supported Data sources

Also, Similar to Spark, Using H2O Sparkling Water we can source the data from below data sources.

NFS / Local File / List of Files
HDFS
URL
A Directory with many data files but not nested folders
Amazon S3/S3N
Other clouds (Azure, GPC e.t.c)

Running Sparkling Water from IDE using Maven

Out of many, one god thing about Sparkling Water is the ease of use as it needs just one dependency to work with. This one library includes all necessary packages to run H2O on Apache Spark.

 <dependency>
      <groupId>ai.h2o</groupId>
      <artifactId>sparkling-water-package_2.11</artifactId>
      <version>3.28.0.3-1-2.4</version>
 </dependency>

Having said that, you need to explicitly include all Scala and Apache Spark dependencies along with Spark MLlib to run Sparkling Water examples from IDE.

Examples I have explained in this Sparkling Water tutorials are present at the GitHub project with all dependencies, all you need is clone this project and run these examples as-is.

H2OContext

H2OContext is an entry point to the H2O and using Sparkling Water and this uses SparkSession hence you need to create SparkSession object before creating H2OConext by using getOrCreate().


val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate();

val h2oContext = H2OContext.getOrCreate(spark)

H2OFrame

What is H2OFrame

An org.apache.spark.h2o.H2OFrame is a 2D array of data where each column is uniformly-typed and the data is held in either local or in H2O cluster. Data in H2O is compressed and is held in the JVM heap while processing. H2OFrame is nothing but a wrapped collection object where you can iterate and perform operations similar to Spark DataFrame and Python pandas DataFrame. unlike DataFrame’s the data is not held in memory.

Create H2OFrame

We can create an H2OFrame by loading either CSV, or compressed CSV


  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipCodeFrame = new H2OFrame(new File(dataFile))
  println(zipCodeFrame.names().mkString(","))

Let’s see another example of creating H2OFrame from a Parquet file.


  val parquetDataFile = "src/main/resources/zipcodes.parquet"
  val zipCodeParquetFrame = new H2OFrame(new File(parquetDataFile))
  println(zipCodeParquetFrame.names().mkString(","))

For a complete example, please refer H2OFrame

Creating H2OFrame from Spark DataFrame


val zipCodes = "src/main/resources/small_zipcode.csv"
  val zipCodesDF = spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv(zipCodes)

  zipCodesDF.printSchema()
  zipCodesDF.show(false)
  val h2oContext = H2OContext.getOrCreate(spark)
  val h2oFrame = h2oContext.asH2OFrame(zipCodesDF)
  println(h2oFrame._names.mkString(","))

Converting H2OFrame to Spark DataFrame


  val h2oContext = H2OContext.getOrCreate(spark)
  //Creating H2oFrame
  import java.io.File
  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipH2OFrame = new H2OFrame(new File(dataFile))
  
  //Convert H2OFrame to Spark DataFrame
  val zipDF = h2oContext.asDataFrame(zipH2OFrame)
  zipDF.printSchema()
  zipDF.show(false)

References:

Happy Learning !!