In this tutorial, you will learn how to install H2O Sparkling Water on Linux Ubuntu and running H2O sparkling-shell and Flow web interface. In order to run Sparkling Water, you need to have an Apache Spark installed.
Sparkling Water enables users to run H2O machine learning algorithms on the Spark cluster which allows H2O to benefit from Spark capabilities like fast, scalable and distributed in-memory processing.
1. Install Java
Sparkling Water needs Java to be installed, Run below command to install JDK, In my case I am using openJDKsudo apt-get -y install openjdk-8-jdk-headless
Post JDK install, check if it installed successfully by running “java -version”
2. Download and Install Apache Spark
First, download Apache Spark, unzip the binary to a directory on your computer and have the SPARK_HOME environment variable set to the Spark home directory. I’ve downloaded spark-2.4.4-bin-hadoop2.7 version, Depending on when you reading this download the latest version available and the steps should not have changed much.
3. Download & Install H2O Sparkling Water
Now, download H2O Sparkling Water
and unzip the downloaded file. In case if you don’t have unzip package installed, install it using
sudo apt install unzip
In my case, I’ve download Sparkling Water version 3.28 which supports Spark 2.4.4 and unzip into
4. Start Sparkling Shell on Ubuntu
To start Sparkling shell,
cd /home/ubuntu/sparkling-water-220.127.116.11-1-2.4 and run
./bin/sparkling-shell which outputs something like below. This also initializes Spark Context with Web UI available at
http://192.168.56.1:4040 (change IP address to your system IP)
[email protected]:~/sparkling-water-18.104.22.168-1-2.4$ ./bin/sparkling-shell Using Spark defined in the SPARK_HOME=/home/ubuntu/spark environmental property ----- Spark master (MASTER) : local[*] Spark home (SPARK_HOME) : /home/ubuntu/spark H2O build version : 22.214.171.124 (yu) Sparkling Water version : 126.96.36.199-1-2.4 Spark build version : 2.4.4 Scala version : 2.11 ---- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://namenode.socal.rr.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1581895354791). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala>
Now let’s create H2OContext by taking SparkSession object “
spark” as a parameter, This creates an H2O Cloud inside the Spark Cluster.
scala> import org.apache.spark.h2o._ import org.apache.spark.h2o._ scala> val h2oContext = H2OContext.getOrCreate(spark) 2020-02-16 23:53:28,362 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins, we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1. E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) We also recommend to avoid using broadcast hints in your Spark SQL code. h2oContext: org.apache.spark.h2o.H2OContext = Sparkling Water Context: * Sparkling Water Version: 188.8.131.52-1-2.4 * H2O name: sparkling-water-ubuntu_local-1581897180995 * cluster size: 1 * list of used nodes: (executorId, host, port) ------------------------ (driver,192.168.56.1,54321) ------------------------ Open H2O Flow in browser: http://192.168.56.1:54321 (CMD + click in Mac OSX) scala>
This also runs an H2O Flow web UI interface to interact and run machine learning models. Open Flow in browser:
http://192.168.56.1:54321 (change the IP address to your system IP) . For now, ignore the warnings you get.
In this article, you have learned how to install H2O Sparkling Water on Linux Ubuntu OS and running sparkling-shell and finally created H2OContext where you can access the H2O Flow web UI interface.
Happy Learning !!