• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:5 mins read
You are currently viewing PySpark Convert Dictionary/Map to Multiple Columns

PySpark DataFrame MapType is used to store Python Dictionary (Dict) object, so you can convert MapType (map) column to Multiple columns ( separate DataFrame column for every key-value).

First let’s create a DataFrame with MapType column.


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

dataDictionary = [
        ('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'grey','eye':'grey'}),
        ('Jefferson',{'hair':'brown','eye':''})
        ]

df = spark.createDataFrame(data=dataDictionary, schema = ['name','properties'])
df.printSchema()
df.show(truncate=False)

This yields below DataFrame Schema and table. properties is a MapType (dict) column which I am going to convert to columns.


root
 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |[eye -> brown, hair -> black]|
|Michael   |[eye ->, hair -> brown]      |
|Robert    |[eye -> black, hair -> red]  |
|Washington|[eye -> grey, hair -> grey]  |
|Jefferson |[eye -> , hair -> brown]     |
+----------+-----------------------------+

Convert Dictionary/MapType to Multiple Columns

From the above PySpark DataFrame, Let’s convert the Map/Dictionary values of the properties column into individual columns and name them the same as map keys.

By using getItem() of the org.apache.spark.sql.Column class we can get the value of the map key. This method takes a map key string as a parameter. By using this let’s extract the values for each key from the map.

So In order to use this function, you need to know the keys you wanted to extract from a MapType column.


df3=df.rdd.map(lambda x: \
    (x.name,x.properties["hair"],x.properties["eye"])) \
    .toDF(["name","hair","eye"])
df3.printSchema()
df3.show()

Yields below output.


root
 |-- name: string (nullable = true)
 |-- hair: string (nullable = true)
 |-- eye: string (nullable = true)

+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+

Below is another approach to convert PySpark MapType column to multiple columns.


df.withColumn("hair",df.properties.getItem("hair")) \
  .withColumn("eye",df.properties.getItem("eye")) \
  .drop("properties") \
  .show()

Alternatively you can also do.


df.withColumn("hair",df.properties["hair"]) \
  .withColumn("eye",df.properties["eye"]) \
  .drop("properties") \
  .show()

In order to use above approaches, you need to know all unique keys in map column.

Convert All map keys into Columns

If you have multiple columns, it’s not good to hardcode map key names, let’s see the same by programmatically.


from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.properties))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("properties").getItem(x).alias(str(x)), keysList))
df.select(df.name, *keyCols).show()

This yields the same output as above.

You can find complete example at GitHub PySpark Examples project.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium