By default Spark SQL infer schema while reading JSON file, but, we can ignore this and read a JSON with schema (user-defined) using spark.read.schema("schema")
method.
Spark Schema defines the structure of the data (column name, datatype, nested columns, nullable e.t.c), and when it specified while reading a file, DataFrame interprets and reads the file in a specified schema, once DataFrame created, it becomes the structure of the DataFrame. Spark SQL provides StructType & StructField classes to programmatically specify the schema.
Spark Read JSON with schema
Use the StructType class to create a custom schema, below we initiate this class and use add a method to add columns to it by providing the column name, data type and nullable option.
//Define custom schema
val schema = new StructType()
.add("City", StringType, true)
.add("Country", StringType, true)
.add("Decommisioned", BooleanType, true)
.add("EstimatedPopulation", LongType, true)
.add("Lat", DoubleType, true)
.add("Location", StringType, true)
.add("LocationText", StringType, true)
.add("LocationType", StringType, true)
.add("Long", DoubleType, true)
.add("Notes", StringType, true)
.add("RecordNumber", LongType, true)
.add("State", StringType, true)
.add("TaxReturnsFiled", LongType, true)
.add("TotalWages", LongType, true)
.add("WorldRegion", StringType, true)
.add("Xaxis", DoubleType, true)
.add("Yaxis", DoubleType, true)
.add("Zaxis", DoubleType, true)
.add("Zipcode", StringType, true)
.add("ZipCodeType", StringType, true)
val df_with_schema = spark.read.schema(schema).json("src/main/resources/zipcodes.json")
df_with_schema.printSchema()
df_with_schema.show(false)
The above example ignores the default schema and uses the custom schema while reading a JSON file. this outputs the schema from printSchema()
method and outputs the data.
Read Schema from JSON file
If you have too many fields and the structure of the DataFrame changes now and then, it’s a good practice to load the Spark SQL schema from the JSON file. Note the definition in JSON uses the different layout and you can get this by using schema.prettyJson()
and put this JSON string in a file.
val url = ClassLoader.getSystemResource("schema.json")
val schemaSource = Source.fromFile(url.getFile).getLines.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
val df2 = spark.read.schema(schemaFromJson )
.json("src/main/resources/zipcodes.json")
df2.printSchema()
df2.show(false)
This prints the same output as the previous section. You can also, have a name, type, and flag for nullable in a comma-separated file and we can use these to create a struct programmatically, I will leave this to you to explore
Reading schema from DDL string
Like loading structure from JSON string, we can also create it from DLL, you can also generate DDL from a schema using toDDL()
. printTreeString() on struct object prints the schema similar to printSchema
function returns.
val ddlSchemaStr = "`fullName` STRUCT,`age` INT,`gender` STRING"
val ddlSchema = StructType.fromDDL(ddlSchemaStr)
ddlSchema.printTreeString()
Converting Case class to Schema
If you have a Scala case class representing your input JSON schema, Spark SQL provides Encoders to convert case class to struct schema object. If you are using older versions of Spark, you can also transform the case class to the schema using the Scala hack. Both examples are present here.
case class Name(first:String,last:String,middle:String)
case class Employee(fullName:Name,age:Integer,gender:String)
//Using Scala Hack
import org.apache.spark.sql.catalyst.ScalaReflection
val scalaSchema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]
//Using encoders
val encoderSchema = Encoders.product[Employee].schema
encoderSchema.printTreeString()
This complete example is available at GitHub
Conclusion
In this tutorial, you have learned how to read JSON file with custom schema and also learned different ways to create spark SQL schema.
Happy Learning !!