PySpark MapType
(map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict
) data structure.
While reading a JSON
file with dictionary data, PySpark by default infers the dictionary (Dict
) data and create a DataFrame with MapType
column, Note that PySpark doesn’t have a dictionary type instead it uses MapType
to store the dictionary data.
In this article, I will explain how to manually create a PySpark DataFrame from Python Dict
, and explain how to read Dict elements by key, and some map operations using SQL functions. First, let’s create data with a list of Python Dictionary (Dict) objects, below example has 2 columns of type String & Dictionary as {key:value,key:value}
.
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'red','eye':'grey'}),
('Jefferson',{'hair':'red','eye':''})
]
Create DataFrame from Dictionary (Dict) Example
Now create a PySpark DataFrame from Dictionary object and name it as properties
, In Pyspark key & value types can be any Spark type that extends org.apache.spark.sql.types.DataType
.
df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])
df.printSchema()
df.show(truncate=False)
This displays the PySpark DataFrame schema & result of the DataFrame. Notice that the dictionary column properties
is represented as map
on below schema.
root
|-- name: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+----------+-----------------------------+
|name |properties |
+----------+-----------------------------+
|James |[eye -> brown, hair -> black]|
|Michael |[eye ->, hair -> brown] |
|Robert |[eye -> black, hair -> red] |
|Washington|[eye -> grey, hair -> grey] |
|Jefferson |[eye -> , hair -> brown] |
+----------+-----------------------------+
Create a DataFrame Dictionary Column Using StructType
As I said in the beginning, PySpark doesn’t have a Dictionary type instead it uses MapType
to store the dictionary object, below is an example of how to create a DataFrame column MapType
using pyspark.sql.types.StructType
.
MapType(StringType(),StringType())
– Here both key and value is a StringType
.
from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])
df2 = spark.createDataFrame(data=dataDictionary, schema = schema)
This creates a DataFrame with the same schema as above.
Extract Values from DataFrame Dictionary Column
Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties
(MapType column)
df.rdd.map(lambda x:
(x.name,x.properties["hair"],x.properties["eye"])
).toDF(["hair","eye"]).show()
+----------+-----+-----+
| name| hair| eye|
+----------+-----+-----+
| James|black|brown|
| Michael|brown| null|
| Robert| red|black|
|Washington| grey| grey|
| Jefferson|brown| |
+----------+-----+-----+
Let’s use another way to get the value of a key from Map using getItem()
of Column
type, this method takes key as argument and returns a value.
df.withColumn("hair",df.properties.getItem("hair")) \
.withColumn("eye",df.properties.getItem("eye")) \
.drop("properties") \
.show()
df.withColumn("hair",df.properties["hair"]) \
.withColumn("eye",df.properties["eye"]) \
.drop("properties") \
.show()
Conclusion
Spark doesn’t have a Dict type, instead it contains a MapType also referred as map to store Python Dictionary elements, In this article you have learn how to create a MapType column on using StructType and retrieving values from map column.
Happy Learning !!
Related Articles
- PySpark Convert Dictionary/Map to Multiple Columns
- PySpark Convert DataFrame Columns to MapType (Dict)
- PySpark MapType (Dict) Usage with Examples
- PySpark Convert StructType (struct) to Dictionary/MapType (map)
- PySpark partitionBy() – Write to Disk Example
- PySpark mapPartitions() Examples
- PySpark withColumnRenamed to Rename Column on DataFrame