You are currently viewing Spark read JSON with or without schema

By default Spark SQL infer schema while reading JSON file, but, we can ignore this and read a JSON with schema (user-defined) using spark.read.schema("schema") method.

What is Spark Schema

Spark Schema defines the structure of the data (column name, datatype, nested columns, nullable e.t.c), and when it specified while reading a file, DataFrame interprets and reads the file in a specified schema, once DataFrame created, it becomes the structure of the DataFrame. Spark SQL provides StructType & StructField classes to programmatically specify the schema.

Spark Read JSON with schema

Use the StructType class to create a custom schema, below we initiate this class and use add a method to add columns to it by providing the column name, data type and nullable option.


    //Define custom schema
    val schema = new StructType()
      .add("City", StringType, true)
      .add("Country", StringType, true)
      .add("Decommisioned", BooleanType, true)
      .add("EstimatedPopulation", LongType, true)
      .add("Lat", DoubleType, true)
      .add("Location", StringType, true)
      .add("LocationText", StringType, true)
      .add("LocationType", StringType, true)
      .add("Long", DoubleType, true)
      .add("Notes", StringType, true)
      .add("RecordNumber", LongType, true)
      .add("State", StringType, true)
      .add("TaxReturnsFiled", LongType, true)
      .add("TotalWages", LongType, true)
      .add("WorldRegion", StringType, true)
      .add("Xaxis", DoubleType, true)
      .add("Yaxis", DoubleType, true)
      .add("Zaxis", DoubleType, true)
      .add("Zipcode", StringType, true)
      .add("ZipCodeType", StringType, true)

    val df_with_schema = spark.read.schema(schema).json("src/main/resources/zipcodes.json")
    df_with_schema.printSchema()
    df_with_schema.show(false)

The above example ignores the default schema and uses the custom schema while reading a JSON file. this outputs the schema from printSchema() method and outputs the data.

Read Schema from JSON file

If you have too many fields and the structure of the DataFrame changes now and then, it’s a good practice to load the Spark SQL schema from the JSON file. Note the definition in JSON uses the different layout and you can get this by using schema.prettyJson()  and put this JSON string in a file.


val url = ClassLoader.getSystemResource("schema.json")
val schemaSource = Source.fromFile(url.getFile).getLines.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
val df2 = spark.read.schema(schemaFromJson )
       .json("src/main/resources/zipcodes.json")
df2.printSchema()
df2.show(false)

This prints the same output as the previous section. You can also, have a name, type, and flag for nullable in a comma-separated file and we can use these to create a struct programmatically, I will leave this to you to explore

Reading schema from DDL string

Like loading structure from JSON string, we can also create it from DLL, you can also generate DDL from a schema using toDDL(). printTreeString() on struct object prints the schema similar to printSchemafunction returns.


val ddlSchemaStr = "`fullName` STRUCT,`age` INT,`gender` STRING"
 val ddlSchema = StructType.fromDDL(ddlSchemaStr)
 ddlSchema.printTreeString()

Converting Case class to Schema

If you have a Scala case class representing your input JSON schema, Spark SQL provides Encoders to convert case class to struct schema object. If you are using older versions of Spark, you can also transform the case class to the schema using the Scala hack. Both examples are present here.


case class Name(first:String,last:String,middle:String)
case class Employee(fullName:Name,age:Integer,gender:String)
//Using Scala Hack
import org.apache.spark.sql.catalyst.ScalaReflection
val scalaSchema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]
//Using encoders
val encoderSchema = Encoders.product[Employee].schema
encoderSchema.printTreeString()

This complete example is available at GitHub

Conclusion

In this tutorial, you have learned how to read JSON file with custom schema and also learned different ways to create spark SQL schema.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium