Let’s say you have the following Spark DataFrame that has StructType (struct) column “properties
” and you wanted to convert Struct to Map (MapType) column.
val structureData = Seq( Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA")) )
val structureSchema = new StructType().add("id",StringType)
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType) )
var df = spark.createDataFrame( spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema() df.show(false)
This output the following Spark schema and data frame.
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- salary: integer (nullable = true)
| |-- location: string (nullable = true)
+-----+---------+-----------+
|id |dept |properties |
+-----+---------+-----------+
|36636|Finance |[3000, USA]|
|40288|Finance |[5000, IND]|
|42114|Sales |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales |[6500, USA]|
+-----+---------+-----------+
Now I want to convert StructType
column “properties
” to a MapType
column on DataFrame as shown below.
+-----+---------+---------------------------------+
|id |dept |propertiesMap |
+-----+---------+---------------------------------+
|36636|Finance |[salary -> 3000, location -> USA]|
|40288|Finance |[salary -> 5000, location -> IND]|
|42114|Sales |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+
Solution: By using the map() sql function you can create a Map type. In order to convert, first, you need to collect all the columns in a struct type and pass them as a list to this map() function.
val index = df.schema.fieldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fields.foreach(field =>{
columns.add(lit(field.name))
columns.add(col("properties." + field.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
This yields below output.
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- propertiesMap: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+-----+---------+---------------------------------+
|id |dept |propertiesMap |
+-----+---------+---------------------------------+
|36636|Finance |[salary -> 3000, location -> USA]|
|40288|Finance |[salary -> 5000, location -> IND]|
|42114|Sales |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+
Happy Learning !!
Related Articles
- How to Convert Struct type to Columns in Spark
- Spark from_json() – Convert JSON Column to Struct, Map or Multiple Columns
- Spark SQL – Flatten Nested Struct Column
- Spark Unstructured vs semi-structured vs Structured data
- Spark – Create a DataFrame with Array of Struct column
- Spark – explode Array of Struct to rows
- Spark SQL StructType & StructField with examples
- Spark – Using XStream API to write complex XML structures