In this Spark DataFrame article, I will explain how to convert the map (MapType) column into multiple columns (one column for each map key) using a Scala example.
Spark supports multiple map functions to get the keys and values of the map columns and also has few methods on column class to work with MapTypes. Let’s see these functions with examples.
Before we proceed with an example of how to convert map type column into multiple columns, first, let’s create a DataFrame.
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
// Creating DF with MapType
val arrayStructureData = Seq(
Row("James",Map("hair"->"black","eye"->"brown")),
Row("Michael",Map("hair"->"gray","eye"->"black")),
Row("Robert",Map("hair"->"brown"))
)
val mapType = DataTypes.createMapType(StringType,StringType)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("property", MapType(StringType,StringType))
val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
mapTypeDF.printSchema()
mapTypeDF.show(false)
This yields output as below
// Output:
root
|-- name: string (nullable = true)
|-- property: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+-------+-----------------------------+
|name |property |
+-------+-----------------------------+
|James |[hair -> black, eye -> brown]|
|Michael|[hair -> gray, eye -> black] |
|Robert |[hair -> brown] |
+-------+-----------------------------+
From above DataFrame, Let’s convert the map values of the property
column into individual columns and name them “hair_color
” and “eye_color
”
By using getItem()
of the org.apache.spark.sql.Column
class we can get the value of the map key. This method takes a map key string as a parameter. By using this let’s extract the values for each key from the map.
so In order to use this function, you need to know the keys you wanted to extract from a MapType column.
mapTypeDF.select(col("name"),
col("property").getItem("hair").as("hair_color"),
col("property").getItem("eye").as("eye_color"))
.show(false)
This yields below output with new columns hair_color
and eye_color
// Output:
+-------+----------+---------+
|name |hair_color|eye_color|
+-------+----------+---------+
|James |black |brown |
|Michael|gray |black |
|Robert |brown |null |
+-------+----------+---------+
Note that if a key is not present on any row, getItem()
function returns a null value.
In case if your keys are dynamic and you wanted to automatically find all keys and convert into columns.
import spark.implicits._
val keysDF = mapTypeDF.select(explode(map_keys($"property"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0))
val keyCols = keys.map(f=> col("property").getItem(f).as(f.toString))
mapTypeDF.select(col("name") +: keyCols:_*).show(false)
This yields below output
// Output:
+-------+-----+-----+
|name |hair |eye |
+-------+-----+-----+
|James |black|brown|
|Michael|gray |black|
|Robert |brown|null |
+-------+-----+-----+
Complete Example
Below is a complete example of how to convert Spark DataFrame Map MapType column into multiple columns.
package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
object MapToColumn extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val arrayStructureData = Seq(
Row("James",Map("hair"->"black","eye"->"brown")),
Row("Michael",Map("hair"->"gray","eye"->"black")),
Row("Robert",Map("hair"->"brown"))
)
val mapType = DataTypes.createMapType(StringType,StringType)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("property", MapType(StringType,StringType))
val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
mapTypeDF.printSchema()
mapTypeDF.show(false)
mapTypeDF.select(col("name"),
col("property").getItem("hair").as("hair_color"),
col("property").getItem("eye").as("eye_color"))
.show(false)
import spark.implicits._
val keysDF = mapTypeDF.select(explode(map_keys($"property"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0))
val keyCols = keys.map(f=> col("property").getItem(f).as(f.toString))
mapTypeDF.select(col("name") +: keyCols:_*).show(false)
}
This example is also available at Spark Example Github project for reference.
Happy Learning !!
Super helpful!!
Thanks for that. I feel like “pivot” should be one of the answers, or at least explain it can be done using this mechanism and reference the relevant page.