You are currently viewing Spark Convert case class to Schema
Photo by Shane Rounce on Unsplash

Spark SQL provides Encoders to convert case class to the spark schema (struct StructType object), If you are using older versions of Spark, you can create spark schema from case class using the Scala hack. Both options are explained here with examples.

First, let’s create a case class “Name” & ‘Employee”


case class Name(first:String,last:String,middle:String)
case class Employee(fullName:Name,age:Integer,gender:String)

1. Using Spark Encoders to convert case class to schema

Echoders class has a method product that takes scala class “Employee” as a type and uses the schema method to return the schema of an Employee class.


// Using Spark Encoders to convert case class to schema
val encoderSchema = Encoders.product[Employee].schema
encoderSchema.printTreeString()

printTreeString() on schema object outputs the below schema.


// Output:
root
 |-- fullName: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- last: string (nullable = true)
 |    |-- middle: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)

2. Using Scala code to create schema from case class

We can also use just scala code without Spark SQL encoders to create spark schema from case class, In order to convert, we would need to use ScalaReflection class and use schemaFor


// Using Scala code to create schema from case class
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]

3. Converting SQL DDL String to the schema

Like loading structure from JSON string, we can also create it from DLL, you can also generate DDL from a schema using toDDL(). printTreeString() on struct object prints the schema similar to printSchemafunction returns.


// Converting SQL DDL String to the schema
val ddlSchemaStr = "`fullName` STRUCT,`age` INT,`gender` STRING"
 val ddlSchema = StructType.fromDDL(ddlSchemaStr)
 ddlSchema.printTreeString()

This complete example is available at Spark-Scala-Examples GitHub project for download

4. Complete Code


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.StructType

object CaseClassSparkSchema extends App{

  case class Name(first:String,last:String,middle:String)
  case class Employee(fullName:Name,age:Integer,gender:String)

  val encoderSchema = Encoders.product[Employee].schema
  encoderSchema.printTreeString()

  import org.apache.spark.sql.catalyst.ScalaReflection
  val schema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]

}

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium