Site icon Spark By {Examples}

Sparkling Water – H2OFrame

sparkling water h2Oframe

Sparkling water org.apache.spark.h2o.H2OFrame is a wrapper on Java H2O Frame (water.fvec.Frame) to work with Spark and Scala. It is a 2D array of data where each column is uniformly-typed and the data is held in either local or in H2O cluster.

Data in H2O is compressed and is held in the JVM heap while processing. H2OFrame is nothing but a wrapped collection object where you can iterate and perform operations similar to Spark DataFrame and Python pandas DataFrame. unlike DataFrame’s the data is not held in memory.

The Frame is a collection of named Vecs; a Vec is a collection of numbered Chunks.

1. Create H2OFrame from constructors


H2OFrame(key : Key[Frame])
H2OFrame(f : Frame)
H2OFrame(s: String)
H2OFrame(file : File)
H2OFrame(uri : URI)

example:


  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipCodeFrame = new H2OFrame(new File(dataFile))
  println(zipCodeFrame.names().mkString(","))

Let’s see another example of creating H2OFrame from a Parquet file.


  val parquetDataFile = "src/main/resources/zipcodes.parquet"
  val zipCodeParquetFrame = new H2OFrame(new File(parquetDataFile))
  println(zipCodeParquetFrame.names().mkString(","))

2. Converting Spark DataFrame to H2OFrame

Using asH2OFrame() method provided in H2OFrame, we can transfer data from the specified DataFrame into the H2O K/V datastore.


def asH2OFrame(sFr : DataFrame): H2OFrame

example:


val zipCodes = "src/main/resources/small_zipcode.csv"
val zipCodesDF = spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv(zipCodes)

val h2oContext = H2OContext.getOrCreate(spark)
val h2oFrame = h2oContext.asH2OFrame(zipCodesDF)
println(h2oFrame._names.mkString(","))

3. Converting H2OFrame to Spark DataFrame

Using asDataFrame() method provided in H2OFrame class, we can convert the H2OFrame to Spark DataFrame.


def asDataFrame(fr : H2OFrame): DataFrame

example:


  val h2oContext = H2OContext.getOrCreate(spark)
  //Creating H2oFrame
  import java.io.File
  val dataFile = "src/main/resources/small_zipcode.csv"
  val zipH2OFrame = new H2OFrame(new File(dataFile))
  
  //Convert H2OFrame to Spark DataFrame
  val zipDF = h2oContext.asDataFrame(zipH2OFrame)

4. H2OFrame as a Data source


val df = spark.read.h2o(frame.key)
Exit mobile version