You are currently viewing Install & Running Sparkling Water on Ubuntu

In this tutorial, you will learn how to install H2O Sparkling Water on Linux Ubuntu and running H2O sparkling-shell and Flow web interface. In order to run Sparkling Water, you need to have an Apache Spark installed.

Sparkling Water enables users to run H2O machine learning algorithms on the Spark cluster which allows H2O to benefit from Spark capabilities like fast, scalable and distributed in-memory processing.

1. Install Java

Sparkling Water needs Java to be installed, Run the below command to install JDK, In my case, I am using openJDK

sudo apt-get -y install openjdk-8-jdk-headless

Post JDK install, check if it installed successfully by running “java -version”

2. Download and Install Apache Spark

First, download Apache Spark, unzip the binary to a directory on your computer and have the SPARK_HOME environment variable set to the Spark home directory. I’ve downloaded spark-2.4.4-bin-hadoop2.7 version, Depending on when you reading this download the latest version available and the steps should not have changed much.

3. Download & Install H2O Sparkling Water

Now, download H2O Sparkling Water

ubuntu@namenode:~$ wget https://s3.amazonaws.com/h2o-release/sparkling-water/spark-2.4/3.28.0.3-1-2.4/sparkling-water-3.28.0.3-1-2.4.zip

and unzip the downloaded file. In case if you don’t have unzip package installed, install it using sudo apt install unzip

ubuntu@namenode:~$ unzip sparkling-water-3.28.0.3-1-2.4.zip

In my case, I’ve download Sparkling Water version 3.28 which supports Spark 2.4.4 and unzip into /home/ubuntu/sparkling-water-3.28.0.3-1-2.4

4. Start Sparkling Shell on Ubuntu

To start Sparkling shell, cd /home/ubuntu/sparkling-water-3.28.0.3-1-2.4 and run ./bin/sparkling-shell which outputs something like below. This also initializes Spark Context with Web UI available at http://192.168.56.1:4040 (change IP address to your system IP)


ubuntu@namenode:~/sparkling-water-3.28.0.3-1-2.4$ ./bin/sparkling-shell

Using Spark defined in the SPARK_HOME=/home/ubuntu/spark environmental property


-----
  Spark master (MASTER)     : local[*]
  Spark home   (SPARK_HOME) : /home/ubuntu/spark
  H2O build version         : 3.28.0.3 (yu)
  Sparkling Water version   : 3.28.0.3-1-2.4
  Spark build version       : 2.4.4
  Scala version             : 2.11
----

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://namenode.socal.rr.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1581895354791).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Now let’s create H2OContext by taking SparkSession object “spark” as a parameter, This creates an H2O Cloud inside the Spark Cluster.


scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._

scala> val h2oContext = H2OContext.getOrCreate(spark)
2020-02-16 23:53:28,362 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
h2oContext: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.3-1-2.4
 * H2O name: sparkling-water-ubuntu_local-1581897180995
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.1,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.1:54321 (CMD + click in Mac OSX)
scala>

This also runs an H2O Flow web UI interface to interact and run machine learning models. Open Flow in browser: http://192.168.56.1:54321 (change the IP address to your system IP) . For now, ignore the warnings you get.

Sparkling Water ubuntu H2O Flow

Conclusion

In this article, you have learned how to install H2O Sparkling Water on Linux Ubuntu OS and running sparkling-shell and finally created H2OContext where you can access the H2O Flow web UI interface.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium