In Spark/PySpark from_json()
SQL function is used to convert JSON string from DataFrame column into struct column, Map type, and multiple columns.
1. Spark from_json() Syntax
Following are the different syntaxes of from_json() function.
// Spark from_json() Syntax
from_json(Column jsonStringcolumn, Column schema)
from_json(Column jsonStringcolumn, DataType schema)
from_json(Column jsonStringcolumn, StructType schema)
from_json(Column jsonStringcolumn, DataType schema, Map options)
from_json(Column jsonStringcolumn, Column schema, Map options)
from_json(Column jsonStringcolumn, StructType schema, Map options)
---
---
and more to support Java util Map
jsonStringcolumn
– DataFrame column where you have a JSON string.schema
– JSON schema, supports either DataType, Column, String, and StructType.options
– Optional JSON parsing options. Accepts the same options as JSON data source (spark.read.json)
2. Spark from_json() Usage Example
Let’s create a DataFrame with a column contains JSON string and in the next section, I will parse this column and convert it to MapType (map), struct, and multiple columns using the from_json()
function.
// Spark from_json() Usage Example
val jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
val data = Seq((1, jsonString))
import spark.implicits._
val df=data.toDF("id","value")
df.show(false)
+---+--------------------------------------------------------------------------+
|id |value |
+---+--------------------------------------------------------------------------+
|1 |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+
2.1 Spark Convert JSON Column to Map type Column
By using syntax from_json(Column jsonStringcolumn, DataType schema)
, you can convert Spark DataFrame with JSON string into MapType (map) column. MapType is a subclass of DataType.
// Spark Convert JSON Column to Map type Column
import org.apache.spark.sql.functions.{from_json,col}
import org.apache.spark.sql.types.{MapType, StringType}
val df2=df.withColumn("value",from_json(col("value"),MapType(StringType,StringType)))
df2.printSchema()
df2.show(false)
root
|-- id: integer (nullable = false)
|-- value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---+---------------------------------------------------------------------------+
|id |value |
+---+---------------------------------------------------------------------------+
|1 |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+
Note that by using withColumn()
I have converted the value
column from JSON string to MapType.
2.1 Spark Convert JSON Column to struct Column
Now by using from_json(Column jsonStringcolumn, StructType schema)
, you can convert JSON string on the Spark DataFrame column to a struct type. In order to do so, first, you need to create a StructType for the JSON string.
// Spark Convert JSON Column to struct Column
import org.apache.spark.sql.types.{StringType, StructType}
val schema = new StructType()
.add("Zipcode", StringType, true)
.add("ZipCodeType", StringType, true)
.add("City", StringType, true)
.add("State", StringType, true)
Let’s use this schema on from_json().
val df4=df.withColumn("value",from_json(col("value"),schema))
df4.printSchema()
df4.show(false)
root
|-- id: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- Zipcode: string (nullable = true)
| |-- ZipCodeType: string (nullable = true)
| |-- City: string (nullable = true)
| |-- State: string (nullable = true)
+---+--------------------------------+
|id |value |
+---+--------------------------------+
|1 |[704, STANDARD, PARC PARQUE, PR]|
+---+--------------------------------+
2.2 Spark Convert JSON Column to Multiple Columns
Finally, let’s convert the value
struct to individual columns.
// Spark Convert JSON Column to Multiple Columns
val df5=df4.select(col("id"),col("value.*"))
df5.printSchema()
df5.show()
root
|-- id: integer (nullable = false)
|-- Zipcode: string (nullable = true)
|-- ZipCodeType: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
+---+-------+-----------+-----------+-----+
| id|Zipcode|ZipCodeType| City|State|
+---+-------+-----------+-----------+-----+
| 1| 704| STANDARD|PARC PARQUE| PR|
+---+-------+-----------+-----------+-----+
3. PySpark from_json() Syntax
Following is syntax of from_json() syntax.
// PySpark from_json() Syntax
def from_json(col, schema, options={})
4. PySpark from_json() Usage Example
Since I have already explained how to query and parse JSON string column and convert it to MapType, struct type, and multiple columns above, with PySpark I will just provide the complete example.
// PySpark from_json() Usage Example
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#read json from text file
jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)
#Convert JSON string column to Map type
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)
#Create schema for JSON
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
StructField("Zipcode",StringType(),True),
StructField("ZipCodeType",StringType(),True),
StructField("City",StringType(),True),
StructField("State", StringType(), True)
])
#Convert JSON string column to struct type
from pyspark.sql.functions import col,from_json
df3 = df.withColumn("value",from_json(col("value"),schema))
df3.printSchema()
df3.show(truncate=False)
#Convert to multile columns
df4=df3.select("id", "value.*")
df4.printSchema()
df4.show()
Happy Learning !!