You are currently viewing Convert Spark DataFrame into H2OFrame

In this H2O Sparkling Water tutorial, you will learn how to convert or transform an H2OFrame into Spark SQL Dataframe, H2O Frame is a primary data store for H2O and it is similar to Spark Dataframe difference being it’s not held in memory instead it stores in H2O cluster.

Here, we will create a Spark DataFrame and convert it to Sparkling Water H2OFrame using asH2OFrame() method of the H2OContext object.

Create SparkSession object

First, let’s create a SparkSession object.


val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate();

Create Spark DataFrame

Using SparkSession object “spark” read a CSV file into DataFrame. Below example creates a Spark DataFrame “zipCodesDF


  val zipCodes = "src/main/resources/small_zipcode.csv"
  val zipCodesDF = spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv(zipCodes)

  zipCodesDF.printSchema()
  zipCodesDF.show(false)

Yields below output. For more options on CSV read Spark read CSV file


root
 |-- id: integer (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Create H2OContext object

Now, Let’s create an H2OContext object by passing the spark session object as an argument as we would need H2O context in order to convert.


  val h2oContext = H2OContext.getOrCreate(spark)

Convert Spark DataFrame into H2OFrame

H2OContext provides asH2OFrame() which takes Spark DataFrame object as a parameter and converts to Sparkling Water H2OFrame.


  val h2OFrame = h2oContext.asH2OFrame(zipCodesDF)

let’s see a few operations on H2OFrame, for example,

h2OFrame.names() returns all column names of the H2OFrame.

println(h2OFrame.names().mkString(“,”))

This returns id,zipcode,type,city,state,population

h2OFrame.numRows() – Returns the number of rows in an H2OFrame

h2OFrame.rename() – Renames the column names.


h2OFrame.rename("zipcode","postcode")
println(h2OFrame.names().mkString(","))

id,postcode,type,city,state,population

Complete Example on Converting Spark DataFrame into H2OFrame


package com.sparkbyexamples.spark

import org.apache.spark.h2o.H2OContext
import org.apache.spark.sql.SparkSession

object H2OFrameFromDataFrame extends App {


  val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate();

  val zipCodes = "src/main/resources/small_zipcode.csv"
  val zipCodesDF = spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv(zipCodes)

  val h2oContext = H2OContext.getOrCreate(spark)
  val h2oFrame = h2oContext.asH2OFrame(zipCodesDF)

  println(h2oFrame._names.mkString(","))

  println(h2oFrame.names().mkString(","))

  println(h2oFrame.numRows()) // returns 5

  println(h2oFrame.numCols()) // returns 6

  h2oFrame.rename("zipcode","postcode")
  println(h2oFrame.names().mkString(","))

}

This example along with dependencies is also available at GitHub project.

Conclusion

In this article, you have learned how to create an H2OContext and what is H2OFrame and finally converting Spark SQL DataFrame to Sparkling Water H2OFrame.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. Serina

    Bro, what is this sparkling H2O concept n why we are using it, fst please share the basic information about it.

Comments are closed.