Spark from_json() – Convert JSON Column to Struct, Map or Multiple Columns

In Spark/PySpark from_json() SQL function is used to convert JSON string from DataFrame column into struct column, Map type, and multiple columns.

1. Spark from_json() Syntax

Following are the different syntaxes of from_json() function.


// Spark from_json() Syntax
from_json(Column jsonStringcolumn, Column schema)
from_json(Column jsonStringcolumn, DataType schema)
from_json(Column jsonStringcolumn, StructType schema)
from_json(Column jsonStringcolumn, DataType schema, Map options)
from_json(Column jsonStringcolumn, Column schema, Map options)
from_json(Column jsonStringcolumn, StructType schema, Map options)
---
---
and more to support Java util Map
  • jsonStringcolumn – DataFrame column where you have a JSON string.
  • schema – JSON schema, supports either DataType, Column, String, and StructType.
  • options – Optional JSON parsing options. Accepts the same options as JSON data source (spark.read.json)

2. Spark from_json() Usage Example

Let’s create a DataFrame with a column contains JSON string and in the next section, I will parse this column and convert it to MapType (map), struct, and multiple columns using the from_json() function.


// Spark from_json() Usage Example
val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+

2.1 Spark Convert JSON Column to Map type Column

By using syntax from_json(Column jsonStringcolumn, DataType schema), you can convert Spark DataFrame with JSON string into MapType (map) column. MapType is a subclass of DataType.


// Spark Convert JSON Column to Map type Column
import org.apache.spark.sql.functions.{from_json,col}
import org.apache.spark.sql.types.{MapType, StringType}
val df2=df.withColumn("value",from_json(col("value"),MapType(StringType,StringType)))
df2.printSchema()
df2.show(false)

root
 |-- id: integer (nullable = false)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+

Note that by using withColumn() I have converted the value column from JSON string to MapType.

2.1 Spark Convert JSON Column to struct Column

Now by using from_json(Column jsonStringcolumn, StructType schema), you can convert JSON string on the Spark DataFrame column to a struct type. In order to do so, first, you need to create a StructType for the JSON string.


// Spark Convert JSON Column to struct Column
import org.apache.spark.sql.types.{StringType, StructType}
val schema = new StructType()
    .add("Zipcode", StringType, true)
    .add("ZipCodeType", StringType, true)
    .add("City", StringType, true)
    .add("State", StringType, true)

Let’s use this schema on from_json().


val df4=df.withColumn("value",from_json(col("value"),schema))
df4.printSchema()
df4.show(false)

root
 |-- id: integer (nullable = false)
 |-- value: struct (nullable = true)
 |    |-- Zipcode: string (nullable = true)
 |    |-- ZipCodeType: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- State: string (nullable = true)

+---+--------------------------------+
|id |value                           |
+---+--------------------------------+
|1  |[704, STANDARD, PARC PARQUE, PR]|
+---+--------------------------------+

2.2 Spark Convert JSON Column to Multiple Columns

Finally, let’s convert the value struct to individual columns.


// Spark Convert JSON Column to Multiple Columns
val df5=df4.select(col("id"),col("value.*"))
df5.printSchema()
df5.show()

root
 |-- id: integer (nullable = false)
 |-- Zipcode: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)

+---+-------+-----------+-----------+-----+
| id|Zipcode|ZipCodeType|       City|State|
+---+-------+-----------+-----------+-----+
|  1|    704|   STANDARD|PARC PARQUE|   PR|
+---+-------+-----------+-----------+-----+

3. PySpark from_json() Syntax

Following is syntax of from_json() syntax.


// PySpark from_json() Syntax
def from_json(col, schema, options={})

4. PySpark from_json() Usage Example

Since I have already explained how to query and parse JSON string column and convert it to MapType, struct type, and multiple columns above, with PySpark I will just provide the complete example.


// PySpark from_json() Usage Example
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#read json from text file
jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)

#Convert JSON string column to Map type
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)

#Create schema for JSON
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([ 
    StructField("Zipcode",StringType(),True), 
    StructField("ZipCodeType",StringType(),True), 
    StructField("City",StringType(),True), 
    StructField("State", StringType(), True)
  ])

#Convert JSON string column to struct type
from pyspark.sql.functions import col,from_json
df3 = df.withColumn("value",from_json(col("value"),schema))
df3.printSchema()
df3.show(truncate=False)

#Convert to multile columns
df4=df3.select("id", "value.*")
df4.printSchema()
df4.show()

Happy Learning !!

References

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply