PySpark Convert StructType (struct) to Dictionary/MapType (map)

Problem: How to Convert StructType (struct) DataFrame Column to Map (MapType) Column which is similar to Python Dictionary (Dict).

Solution: PySpark provides a create_map() function that takes a list of column types as an argument and returns a MapType column, so we can use this to convert the DataFrame struct column to map Type. struct is a type of StructType and MapType is used to store Dictionary key-value pair.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [ ("36636","Finance",(3000,"USA")), 
    ("40288","Finance",(5000,"IND")), 
    ("42114","Sales",(3900,"USA")), 
    ("39192","Marketing",(2500,"CAN")), 
    ("34534","Sales",(6500,"USA")) ]
schema = StructType([
     StructField('id', StringType(), True),
     StructField('dept', StringType(), True),
     StructField('properties', StructType([
         StructField('salary', IntegerType(), True),
         StructField('location', StringType(), True)
         ]))
     ])

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

This yields the below output. Here we have a properties struct column that has 2 columns salary and location.


root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- location: string (nullable = true)

+-----+---------+-----------+
|id   |dept     |properties |
+-----+---------+-----------+
|36636|Finance  |[3000, USA]|
|40288|Finance  |[5000, IND]|
|42114|Sales    |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales    |[6500, USA]|
+-----+---------+-----------+

Convert StructType to MapType (map) Column

create_map() is a PySpark SQL function that is used to convert StructType to MapType column.


#Convert struct type to Map
from pyspark.sql.functions import col,lit,create_map
df = df.withColumn("propertiesMap",create_map(
        lit("salary"),col("properties.salary"),
        lit("location"),col("properties.location")
        )).drop("properties")
df.printSchema()
df.show(truncate=False)

This yields below output, properties struct column has been converted to propertiesMap which is MapType(map) column.


root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- propertiesMap: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-----+---------+---------------------------------+
|id   |dept     |propertiesMap                    |
+-----+---------+---------------------------------+
|36636|Finance  |[salary -> 3000, location -> USA]|
|40288|Finance  |[salary -> 5000, location -> IND]|
|42114|Sales    |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales    |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+

You can also achieve this programmatically with out specifying struct column name individually, but I will cover this later.

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply