Spark from_json() – Convert JSON Column to Struct, Map or Multiple Columns

In Spark/PySpark from_json() SQL function is used to convert JSON string from DataFrame column into struct column, Map type, and multiple columns.

1. Spark from_json() Syntax

Following are the different syntaxes of from_json() function.


from_json(Column jsonStringcolumn, Column schema)
from_json(Column jsonStringcolumn, DataType schema)
from_json(Column jsonStringcolumn, StructType schema)
from_json(Column jsonStringcolumn, DataType schema, Map<String,String> options)
from_json(Column jsonStringcolumn, Column schema, Map<String,String> options)
from_json(Column jsonStringcolumn, StructType schema, Map<String,String> options)
---
---
and more to support Java util Map
  • jsonStringcolumn – DataFrame column where you have a JSON string.
  • schema – JSON schema, supports either DataType, Column, String, and StructType.
  • options – Optional JSON parsing options. Accepts the same options as JSON data source (spark.read.json)

2. Spark from_json() Usage Example

Let’s create a DataFrame with a column contains JSON string and in the next section, I will parse this column and convert it to MapType (map), struct, and multiple columns using the from_json() function.


val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+

2.1 Spark Convert JSON Column to Map type Column

By using syntax from_json(Column jsonStringcolumn, DataType schema), you can convert Spark DataFrame with JSON string into MapType (map) column. MapType is a subclass of DataType.


import org.apache.spark.sql.functions.{from_json,col}
import org.apache.spark.sql.types.{MapType, StringType}
val df2=df.withColumn("value",from_json(col("value"),MapType(StringType,StringType)))
df2.printSchema()
df2.show(false)

root
 |-- id: integer (nullable = false)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+

Note that by using withColumn() I have converted the value column from JSON string to MapType.

2.1 Spark Convert JSON Column to struct Column

Now by using from_json(Column jsonStringcolumn, StructType schema), you can convert JSON string on the Spark DataFrame column to a struct type. In order to do so, first, you need to create a StructType for the JSON string.


import org.apache.spark.sql.types.{StringType, StructType}
val schema = new StructType()
    .add("Zipcode", StringType, true)
    .add("ZipCodeType", StringType, true)
    .add("City", StringType, true)
    .add("State", StringType, true)

Let’s use this schema on from_json().


val df4=df.withColumn("value",from_json(col("value"),schema))
df4.printSchema()
df4.show(false)

root
 |-- id: integer (nullable = false)
 |-- value: struct (nullable = true)
 |    |-- Zipcode: string (nullable = true)
 |    |-- ZipCodeType: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- State: string (nullable = true)

+---+--------------------------------+
|id |value                           |
+---+--------------------------------+
|1  |[704, STANDARD, PARC PARQUE, PR]|
+---+--------------------------------+

2.2 Spark Convert JSON Column to Multiple Columns

Finally, let’s convert the value struct to individual columns.


val df5=df4.select(col("id"),col("value.*"))
df5.printSchema()
df5.show()

root
 |-- id: integer (nullable = false)
 |-- Zipcode: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)

+---+-------+-----------+-----------+-----+
| id|Zipcode|ZipCodeType|       City|State|
+---+-------+-----------+-----------+-----+
|  1|    704|   STANDARD|PARC PARQUE|   PR|
+---+-------+-----------+-----------+-----+

3. PySpark from_json() Syntax

Following is syntax of from_json() syntax.


def from_json(col, schema, options={})

4. PySpark from_json() Usage Example

Since I have already explained how to query and parse JSON string column and convert it to MapType, struct type, and multiple columns above, with PySpark I will just provide the complete example.


from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#read json from text file
jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)

#Convert JSON string column to Map type
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)

#Create schema for JSON
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([ 
    StructField("Zipcode",StringType(),True), 
    StructField("ZipCodeType",StringType(),True), 
    StructField("City",StringType(),True), 
    StructField("State", StringType(), True)
  ])

#Convert JSON string column to struct type
from pyspark.sql.functions import col,from_json
df3 = df.withColumn("value",from_json(col("value"),schema))
df3.printSchema()
df3.show(truncate=False)

#Convert to multile columns
df4=df3.select("id", "value.*")
df4.printSchema()
df4.show()

Happy Learning !!

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply