PySpark MapType
(map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict
) data structure.
While reading a JSON
file with dictionary data, PySpark by default infers the dictionary (Dict
) data and create a DataFrame with MapType
column, Note that PySpark doesn’t have a dictionary type instead it uses MapType
to store the dictionary data.
In this article, I will explain how to create a PySpark DataFrame from Python manually, and explain how to read Dict elements by key, and some map operations using SQL functions. First, let’s create data with a list of Python Dictionary (Dict) objects; below example has two columns of type String & Dictionary as {key:value,key:value}
.
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'red','eye':'grey'}),
('Jefferson',{'hair':'red','eye':''})
]
Create DataFrame from Dictionary (Dict) Example
Now create a PySpark DataFrame from Dictionary object and name it as properties
, In Pyspark key & value types can be any Spark type that extends org.apache.spark.sql.types.DataType
.
df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])
df.printSchema()
df.show(truncate=False)
This displays the PySpark DataFrame schema & result of the DataFrame. Notice that the dictionary column properties
is represented as map
on below schema.
root
|-- name: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+----------+-----------------------------+
|name |properties |
+----------+-----------------------------+
|James |[eye -> brown, hair -> black]|
|Michael |[eye ->, hair -> brown] |
|Robert |[eye -> black, hair -> red] |
|Washington|[eye -> grey, hair -> grey] |
|Jefferson |[eye -> , hair -> brown] |
+----------+-----------------------------+
Create a DataFrame Dictionary Column Using StructType
As I said in the beginning, PySpark doesn’t have a Dictionary type instead it uses MapType
to store the dictionary object, below is an example of how to create a DataFrame column MapType
using pyspark.sql.types.StructType
.
MapType(StringType(),StringType())
– Here both key and value is a StringType
.
from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])
df2 = spark.createDataFrame(data=dataDictionary, schema = schema)
This creates a DataFrame with the same schema as above.
Extract Values from DataFrame Dictionary Column
Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties
(MapType column)
df.rdd.map(lambda x:
(x.name,x.properties["hair"],x.properties["eye"])
).toDF(["hair","eye"]).show()
+----------+-----+-----+
| name| hair| eye|
+----------+-----+-----+
| James|black|brown|
| Michael|brown| null|
| Robert| red|black|
|Washington| grey| grey|
| Jefferson|brown| |
+----------+-----+-----+
Let’s use another way to get the value of a key from Map using getItem()
of Column
type, this method takes key as argument and returns a value.
df.withColumn("hair",df.properties.getItem("hair")) \
.withColumn("eye",df.properties.getItem("eye")) \
.drop("properties") \
.show()
df.withColumn("hair",df.properties["hair"]) \
.withColumn("eye",df.properties["eye"]) \
.drop("properties") \
.show()
Conclusion
Spark doesn’t have a Dict type. Instead, it contains a MapType, also referred to as a map, to store Python Dictionary elements; in this article, you have learned how to create a MapType column using StructType and retrieve values from the map column.
Happy Learning !!
Related Articles
- Create a PySpark DataFrame from Multiple Lists
- How to Convert PySpark Column to List?
- PySpark parallelize() – Create RDD from a list data
- Different Ways to Create PySpark RDD
- PySpark createOrReplaceTempView() Explained
- PySpark Create DataFrame from List
- PySpark – Create an Empty DataFrame & RDD