You are currently viewing Convert Struct to a Map Type in Spark

Let’s say you have the following Spark DataFrame that has StructType (struct) column “properties” and you wanted to convert Struct to Map (MapType) column.


val structureData = Seq( Row("36636","Finance",Row(3000,"USA")), 
    Row("40288","Finance",Row(5000,"IND")), 
    Row("42114","Sales",Row(3900,"USA")), 
    Row("39192","Marketing",Row(2500,"CAN")), 
    Row("34534","Sales",Row(6500,"USA")) ) 
val structureSchema = new StructType().add("id",StringType)
    .add("dept",StringType)
    .add("properties",new StructType()
    .add("salary",IntegerType)
    .add("location",StringType) ) 
var df = spark.createDataFrame( spark.sparkContext.parallelize(structureData),structureSchema)

df.printSchema() df.show(false)

This output the following Spark schema and data frame.


root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- location: string (nullable = true)

+-----+---------+-----------+
|id   |dept     |properties |
+-----+---------+-----------+
|36636|Finance  |[3000, USA]|
|40288|Finance  |[5000, IND]|
|42114|Sales    |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales    |[6500, USA]|
+-----+---------+-----------+

Now I want to convert StructType column “properties” to a MapType column on DataFrame as shown below.


+-----+---------+---------------------------------+
|id   |dept     |propertiesMap                    |
+-----+---------+---------------------------------+
|36636|Finance  |[salary -> 3000, location -> USA]|
|40288|Finance  |[salary -> 5000, location -> IND]|
|42114|Sales    |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales    |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+

Solution: By using the map() sql function you can create a Map type. In order to convert, first, you need to collect all the columns in a struct type and pass them as a list to this map() function.


val index = df.schema.fieldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fields.foreach(field =>{
  columns.add(lit(field.name))
  columns.add(col("properties." + field.name))
})

df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)

This yields below output.


root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- propertiesMap: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-----+---------+---------------------------------+
|id   |dept     |propertiesMap                    |
+-----+---------+---------------------------------+
|36636|Finance  |[salary -> 3000, location -> USA]|
|40288|Finance  |[salary -> 5000, location -> IND]|
|42114|Sales    |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales    |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium