You are currently viewing Sparkling Water – H2OFrame

Sparkling water org.apache.spark.h2o.H2OFrame is a wrapper on Java H2O Frame (water.fvec.Frame) to work with Spark and Scala. It is a 2D array of data where each column is uniformly-typed and the data is held in either local or in H2O cluster.

Data in H2O is compressed and is held in the JVM heap while processing. H2OFrame is nothing but a wrapped collection object where you can iterate and perform operations similar to Spark DataFrame and Python pandas DataFrame. unlike DataFrame’s the data is not held in memory.

The Frame is a collection of named Vecs; a Vec is a collection of numbered Chunks.

1. Create H2OFrame from constructors


H2OFrame(key : Key[Frame])
H2OFrame(f : Frame)
H2OFrame(s: String)
H2OFrame(file : File)
H2OFrame(uri : URI)

example:


  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipCodeFrame = new H2OFrame(new File(dataFile))
  println(zipCodeFrame.names().mkString(","))

Let’s see another example of creating H2OFrame from a Parquet file.


  val parquetDataFile = "src/main/resources/zipcodes.parquet"
  val zipCodeParquetFrame = new H2OFrame(new File(parquetDataFile))
  println(zipCodeParquetFrame.names().mkString(","))

2. Converting Spark DataFrame to H2OFrame

Using asH2OFrame() method provided in H2OFrame, we can transfer data from the specified DataFrame into the H2O K/V datastore.


def asH2OFrame(sFr : DataFrame): H2OFrame

example:


val zipCodes = "src/main/resources/small_zipcode.csv"
val zipCodesDF = spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv(zipCodes)

val h2oContext = H2OContext.getOrCreate(spark)
val h2oFrame = h2oContext.asH2OFrame(zipCodesDF)
println(h2oFrame._names.mkString(","))

3. Converting H2OFrame to Spark DataFrame

Using asDataFrame() method provided in H2OFrame class, we can convert the H2OFrame to Spark DataFrame.


def asDataFrame(fr : H2OFrame): DataFrame

example:


  val h2oContext = H2OContext.getOrCreate(spark)
  //Creating H2oFrame
  import java.io.File
  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipH2OFrame = new H2OFrame(new File(dataFile))
  
  //Convert H2OFrame to Spark DataFrame
  val zipDF = h2oContext.asDataFrame(zipH2OFrame)

4. H2OFrame as a Data source


val df = spark.read.h2o(frame.key)

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium