To convert a StructType
(struct) DataFrame column to a MapType
(map) column in PySpark, you can use the create_map
function from pyspark.sql.functions
. This function allows you to create a map from a set of key-value pairs. Following are the steps.
- Import the required functions from the
pyspark.sql.functions
module. - Extract the fields from the struct column.
- Use
create_map
to combine the extracted fields into a map.
Let’s see with an example.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [ ("36636","Finance",(3000,"USA")),
("40288","Finance",(5000,"IND")),
("42114","Sales",(3900,"USA")),
("39192","Marketing",(2500,"CAN")),
("34534","Sales",(6500,"USA")) ]
schema = StructType([
StructField('id', StringType(), True),
StructField('dept', StringType(), True),
StructField('properties', StructType([
StructField('salary', IntegerType(), True),
StructField('location', StringType(), True)
]))
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
This yields the below output. Here we have a properties
struct column that has 2 columns salary
and location
.
# Output:
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- salary: integer (nullable = true)
| |-- location: string (nullable = true)
+-----+---------+-----------+
|id |dept |properties |
+-----+---------+-----------+
|36636|Finance |[3000, USA]|
|40288|Finance |[5000, IND]|
|42114|Sales |[3900, USA]|
|39192|Marketing|[2500, CAN]|
|34534|Sales |[6500, USA]|
+-----+---------+-----------+
Convert StructType to MapType (map) Column
create_map() is a PySpark SQL function that is used to convert StructType to MapType column.
#Convert struct type to Map
from pyspark.sql.functions import col,lit,create_map
df = df.withColumn("propertiesMap",create_map(
lit("salary"),col("properties.salary"),
lit("location"),col("properties.location")
)).drop("properties")
df.printSchema()
df.show(truncate=False)
This yields below output, properties
struct column has been converted to propertiesMap
which is MapType
(map) column.
root
|-- id: string (nullable = true)
|-- dept: string (nullable = true)
|-- propertiesMap: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+-----+---------+---------------------------------+
|id |dept |propertiesMap |
+-----+---------+---------------------------------+
|36636|Finance |[salary -> 3000, location -> USA]|
|40288|Finance |[salary -> 5000, location -> IND]|
|42114|Sales |[salary -> 3900, location -> USA]|
|39192|Marketing|[salary -> 2500, location -> CAN]|
|34534|Sales |[salary -> 6500, location -> USA]|
+-----+---------+---------------------------------+
You can also achieve this programmatically with out specifying struct column name individually, but I will cover this later.
Happy Learning !!
If you have column names and their data types in a list or dict, you can use this to create a StructType. let me know if you need an example.
Can you please tell hote to achieve this programmatically with out specifying struct column name individually ? Will you cover this aswell?