Spark Schema - Explained with Examples

Spark Schema defines the structure of the DataFrame which you can get by calling printSchema() method on the DataFrame object. Spark SQL provides StructType & StructField classes to programmatically specify the schema.

By default, Spark infers the schema from the data, however, sometimes we may need to define our own schema (column names and data types), especially while working with unstructured and semi-structured data, this article explains how to define simple, nested, and complex schemas with examples.

1. Schema – Defines the Structure of the DataFrame

What is Spark Schema

Spark schema is the structure of the DataFrame or Dataset, we can define it using StructType class which is a collection of StructField that define the column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)

For the rest of the article I’ve explained by using the Scala example, a similar method could be used with PySpark, and if time permits I will cover it in the future. If you are looking for PySpark, I would still recommend reading through this article as it would give you an idea of its usage.

2. Create Schema using StructType & StructField

While creating a Spark DataFrame we can specify the schema using StructType and StructField classes. we can also add nested struct StructType, ArrayType for arrays, and MapType for key-value pairs which we will discuss in detail in later sections.

Spark defines StructType & StructField case class as follows.


case class StructType(fields: Array[StructField])

case class StructField(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    metadata: Metadata = Metadata.empty)

The below example demonstrates a very simple example of using StructType & StructField on DataFrame and its usage with sample data to support it.


import org.apache.spark.sql.types.{IntegerType,StringType,StructType,StructField}
import org.apache.spark.sql.{Row, SparkSession}

val simpleData = Seq(Row("James","","Smith","36636","M",3000),
    Row("Michael","Rose","","40288","M",4000),
    Row("Robert","","Williams","42114","M",4000),
    Row("Maria","Anne","Jones","39192","F",4000),
    Row("Jen","Mary","Brown","","F",-1)
  )

val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("id", StringType, true),
    StructField("gender", StringType, true),
    StructField("salary", IntegerType, true)
  ))

val df = spark.createDataFrame(
     spark.sparkContext.parallelize(simpleData),simpleSchema)

3. Spark DataFrame printSchema()

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object.


df.printSchema()
df.show()

From the above example, printSchema() prints the schema to console(stdout) and show() displays the content of the Spark DataFrame.


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|   id|gender|salary|
+---------+----------+--------+-----+------+------+
|    James|          |   Smith|36636|     M|  3000|
|  Michael|      Rose|        |40288|     M|  4000|
|   Robert|          |Williams|42114|     M|  4000|
|    Maria|      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+

4. Create Nested struct Schema

While working on Spark DataFrame we often need to work with the nested struct columns. On the below example I am using a different approach to instantiating StructType and use add method (instead of StructField) to add column names and datatype.


val structureData = Seq(
    Row(Row("James","","Smith"),"36636","M",3100),
    Row(Row("Michael","Rose",""),"40288","M",4300),
    Row(Row("Robert","","Williams"),"42114","M",1400),
    Row(Row("Maria","Anne","Jones"),"39192","F",5500),
    Row(Row("Jen","Mary","Brown"),"","F",-1)
)

val structureSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("id",StringType)
    .add("gender",StringType)
    .add("salary",IntegerType)

val df2 = spark.createDataFrame(
    spark.sparkContext.parallelize(structureData),structureSchema)
df2.printSchema()
df2.show()

Prints below schema and DataFrame. Note that printSchema() displays struct for nested structure fields.


root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|                name|   id|gender|salary|
+--------------------+-----+------+------+
|    [James, , Smith]|36636|     M|  3100|
|   [Michael, Rose, ]|40288|     M|  4300|
| [Robert, , Willi...|42114|     M|  1400|
| [Maria, Anne, Jo...|39192|     F|  5500|
|  [Jen, Mary, Brown]|     |     F|    -1|
+--------------------+-----+------+------+

5. Loading SQL Schema from JSON

If you have too many fields and the structure of the DataFrame changes now and then, it’s a good practice to load the SQL schema from JSON file. Note the definition in JSON uses the different layout and you can get this by using schema.prettyJson()


{
  "type" : "struct",
  "fields" : [ {
    "name" : "name",
    "type" : {
      "type" : "struct",
      "fields" : [ {
        "name" : "firstname",
        "type" : "string",
        "nullable" : true,
        "metadata" : { }
      }, {
        "name" : "middlename",
        "type" : "string",
        "nullable" : true,
        "metadata" : { }
      }, {
        "name" : "lastname",
        "type" : "string",
        "nullable" : true,
        "metadata" : { }
      } ]
    },
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "dob",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "gender",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "salary",
    "type" : "integer",
    "nullable" : true,
    "metadata" : { }
  } ]
}


val url = ClassLoader.getSystemResource("schema.json")
val schemaSource = Source.fromFile(url.getFile).getLines.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
val df3 = spark.createDataFrame(
      spark.sparkContext.parallelize(structureData),schemaFromJson)
df3.printSchema()

This prints the same output as the previous section. You can also, have a name, type, and flag for nullable in a comma-separated file and we can use these to create a struct programmatically, I will leave this to you to explore.

6. Using Arrays & Map Columns

Spark SQL also supports ArrayType and MapType to define the schema with array and map collections respectively. On the below example, column “hobbies” defined as ArrayType(StringType) and “properties” defined as MapType(StringType,StringType) meaning both key and value as String.


val arrayStructureData = Seq(
    Row(Row("James","","Smith"),List("Cricket","Movies"),Map("hair"->"black","eye"->"brown")),
    Row(Row("Michael","Rose",""),List("Tennis"),Map("hair"->"brown","eye"->"black")),
    Row(Row("Robert","","Williams"),List("Cooking","Football"),Map("hair"->"red","eye"->"gray")),
    Row(Row("Maria","Anne","Jones"),null,Map("hair"->"blond","eye"->"red")),
    Row(Row("Jen","Mary","Brown"),List("Blogging"),Map("white"->"black","eye"->"black"))
)

val arrayStructureSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("hobbies", ArrayType(StringType))
    .add("properties", MapType(StringType,StringType))

val df5 = spark.createDataFrame(
   spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df5.printSchema()
df5.show()

Outputs the below schema and the DataFrame data. Note that field Hobbies is an array type and properties is map type.


root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---------------------+-------------------+------------------------------+
|name                 |hobbies            |properties                    |
+---------------------+-------------------+------------------------------+
|[James, , Smith]    |[Cricket, Movies]  |[hair -> black, eye -> brown] |
|[Michael, Rose, ]   |[Tennis]           |[hair -> brown, eye -> black] |
|[Robert, , Williams]|[Cooking, Football]|[hair -> red, eye -> gray]    |
|[Maria, Anne, Jones]|null               |[hair -> blond, eye -> red]   |
|[Jen, Mary, Brown]   |[Blogging]         |[white -> black, eye -> black]|
+---------------------+-------------------+------------------------------+

7. Convert Scala Case Class to Spark Schema

Spark SQL also provides Encoders to convert case class to struct object. If you are using older versions of Spark, you can also transform the case class to the schema using the Scala hack. Both examples are present here.


case class Name(first:String,last:String,middle:String)
case class Employee(fullName:Name,age:Integer,gender:String)

import org.apache.spark.sql.catalyst.ScalaReflection
val scalaSchema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]

val encoderSchema = Encoders.product[Employee].schema
encoderSchema.printTreeString()

Spark DataFrame printTreeString() outputs the below schema similar to printSchema().


root
 |-- fullName: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- last: string (nullable = true)
 |    |-- middle: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)

8. Creating schema from DDL String

Like loading structure from JSON string, we can also create it from DDL, you can also generate DDL from a schema using toDDL(). printTreeString() on struct object prints the schema similar to printSchemafunction returns.


val ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING,
 `middle`: STRING>,`age` INT,`gender` STRING"
val ddlSchema = StructType.fromDDL(ddlSchemaStr)
ddlSchema.printTreeString()

9. Checking if a Field Exists in a Schema

We often need to check if a column present in a Dataframe schema, we can easily do this using several functions on SQL StructType and StructField.


println(df.schema.fieldNames.contains("firstname"))
println(df.schema.contains(StructField("firstname",StringType,true)))

This example returns “true” for both scenarios. And for the second one if you have IntegetType instead of StringType it returns false as the datatype for first name column is String, as it checks every property ins field. Similarly, you can also check if two schemas are equal and more.

The complete example explained here is available at GitHub project.

Conclusion:

In this article, you have learned the usage of Spark SQL schema, create it programmatically using StructType and StructField, convert case class to the schema, using ArrayType, MapType, and finally how to display the DataFrame schema using printSchema() and printTreeString().

This Post Has 4 Comments

NNK October 19, 2020

Thanks for the feedback and I will consider and try to make examples as easy as possible.
Anonymous October 19, 2020

hey dude , i appreciate your effort but you should explain it more like for any beginner it is difficult to under that which key is used for which purpose like in first content that is about case class,, don’t mind but thank you for help that mean alot
Sushant Singh July 29, 2020

How do i determine the datatype of a column programmatically OR How do I check if the column is of StringType or ArrayType and so on?
Raghav February 11, 2020

Appreciate the schema extraction from case class. Saved a ton of time.

Comments are closed.