Sparkling water org.apache.spark.h2o.H2OFrame
is a wrapper on Java H2O Frame (water.fvec.Frame
) to work with Spark and Scala. It is a 2D array of data where each column is uniformly-typed and the data is held in either local or in H2O cluster.
Data in H2O is compressed and is held in the JVM heap while processing. H2OFrame is nothing but a wrapped collection object where you can iterate and perform operations similar to Spark DataFrame and Python pandas DataFrame. unlike DataFrame’s the data is not held in memory.
The Frame is a collection of named Vecs
; a Vec is a collection of numbered Chunk
s.
1. Create H2OFrame from constructors
H2OFrame(key : Key[Frame])
H2OFrame(f : Frame)
H2OFrame(s: String)
H2OFrame(file : File)
H2OFrame(uri : URI)
example:
val dataFile = "src/main/resources/small_zipcode.csv"
val zipCodeFrame = new H2OFrame(new File(dataFile))
println(zipCodeFrame.names().mkString(","))
Let’s see another example of creating H2OFrame from a Parquet file.
val parquetDataFile = "src/main/resources/zipcodes.parquet"
val zipCodeParquetFrame = new H2OFrame(new File(parquetDataFile))
println(zipCodeParquetFrame.names().mkString(","))
2. Converting Spark DataFrame to H2OFrame
Using asH2OFrame()
method provided in H2OFrame, we can transfer data from the specified DataFrame
into the H2O K/V datastore.
def asH2OFrame(sFr : DataFrame): H2OFrame
example:
val zipCodes = "src/main/resources/small_zipcode.csv"
val zipCodesDF = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv(zipCodes)
val h2oContext = H2OContext.getOrCreate(spark)
val h2oFrame = h2oContext.asH2OFrame(zipCodesDF)
println(h2oFrame._names.mkString(","))
3. Converting H2OFrame to Spark DataFrame
Using asDataFrame()
method provided in H2OFrame class, we can convert the H2OFrame to Spark DataFrame.
def asDataFrame(fr : H2OFrame): DataFrame
example:
val h2oContext = H2OContext.getOrCreate(spark)
//Creating H2oFrame
import java.io.File
val dataFile = "src/main/resources/small_zipcode.csv"
val zipH2OFrame = new H2OFrame(new File(dataFile))
//Convert H2OFrame to Spark DataFrame
val zipDF = h2oContext.asDataFrame(zipH2OFrame)
4. H2OFrame as a Data source
val df = spark.read.h2o(frame.key)