PySpark MapType (also called map type) is a data type to represent Python Dictionary (dict) to store key-value pair, a MapType object comprises three fields, keyType (a DataType), valueType (a DataType) and valueContainsNull (a BooleanType).
PySpark MapType is used to represent map key-value pair similar to python Dictionary (Dict), it extends DataType class which is a superclass of all types in PySpark and takes two mandatory arguments keyType and valueType of type DataType and one optional boolean argument valueContainsNull. keyType and valueType can be any type that extends the DataType class. for e.g StringType, IntegerType, ArrayType, MapType, StructType (struct) e.t.c.
1. Create PySpark MapType
In order to use MapType data type first, you need to import it from pyspark.sql.types.MapType and use MapType() constructor to create a map object.
from pyspark.sql.types import StringType, MapType
mapCol = MapType(StringType(),StringType(),False)
MapType Key Points:
- The First param
keyTypeis used to specify the type of the key in the map. - The Second param
valueTypeis used to specify the type of the value in the map. - Third parm
valueContainsNullis an optional boolean type that is used to specify if the value of the second param can acceptNull/Nonevalues. - The key of the map won’t accept
None/Nullvalues. - PySpark provides several SQL functions to work with
MapType.
2. Create MapType From StructType
Let’s see how to create a MapType by using PySpark StructType & StructField, StructType() constructor takes list of StructField, StructField takes a fieldname and type of the value.
from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])
Now let’s create a DataFrame by using above StructType schema.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'grey','eye':'grey'}),
('Jefferson',{'hair':'brown','eye':''})
]
df = spark.createDataFrame(data=dataDictionary, schema = schema)
df.printSchema()
df.show(truncate=False)
df.printSchema() yields the Schema and df.show() yields the DataFrame output.
root
|-- Name: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+----------+-----------------------------+
|Name |properties |
+----------+-----------------------------+
|James |[eye -> brown, hair -> black]|
|Michael |[eye ->, hair -> brown] |
|Robert |[eye -> black, hair -> red] |
|Washington|[eye -> grey, hair -> grey] |
|Jefferson |[eye -> , hair -> brown] |
+----------+-----------------------------+
3. Access PySpark MapType Elements
Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties (MapType column)
df3=df.rdd.map(lambda x: \
(x.name,x.properties["hair"],x.properties["eye"])) \
.toDF(["name","hair","eye"])
df3.printSchema()
df3.show()
root
|-- name: string (nullable = true)
|-- hair: string (nullable = true)
|-- eye: string (nullable = true)
+----------+-----+-----+
| name| hair| eye|
+----------+-----+-----+
| James|black|brown|
| Michael|brown| null|
| Robert| red|black|
|Washington| grey| grey|
| Jefferson|brown| |
+----------+-----+-----+
Let’s use another way to get the value of a key from Map using getItem() of Column type, this method takes a key as an argument and returns a value.
df.withColumn("hair",df.properties.getItem("hair")) \
.withColumn("eye",df.properties.getItem("eye")) \
.drop("properties") \
.show()
df.withColumn("hair",df.properties["hair"]) \
.withColumn("eye",df.properties["eye"]) \
.drop("properties") \
.show()
4. Functions
Below are some of the MapType Functions with examples.
4.1 – explode
Let’s apply the explode() function to the map column in the DataFrame to expand each key-value pair into separate rows.
from pyspark.sql.functions import explode
df.select(df.name,explode(df.properties)).show()
+----------+----+-----+
| name| key|value|
+----------+----+-----+
| James| eye|brown|
| James|hair|black|
| Michael| eye| null|
| Michael|hair|brown|
| Robert| eye|black|
| Robert|hair| red|
|Washington| eye| grey|
|Washington|hair| grey|
| Jefferson| eye| |
| Jefferson|hair|brown|
+----------+----+-----+
4.2 map_keys() – Get All Map Keys
from pyspark.sql.functions import map_keys
df.select(df.name,map_keys(df.properties)).show()
+----------+--------------------+
| name|map_keys(properties)|
+----------+--------------------+
| James| [eye, hair]|
| Michael| [eye, hair]|
| Robert| [eye, hair]|
|Washington| [eye, hair]|
| Jefferson| [eye, hair]|
+----------+--------------------+
In case if you wanted to get all map keys as Python List. WARNING: This runs very slow.
from pyspark.sql.functions import explode,map_keys
keysDF = df.select(explode(map_keys(df.properties))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
print(keysList)
#['eye', 'hair']
4.3 map_values() – Get All map Values
from pyspark.sql.functions import map_values
df.select(df.name,map_values(df.properties)).show()
+----------+----------------------+
| name|map_values(properties)|
+----------+----------------------+
| James| [brown, black]|
| Michael| [, brown]|
| Robert| [black, red]|
|Washington| [grey, grey]|
| Jefferson| [, brown]|
+----------+----------------------+
Conclusion
MapType is a map data structure that is used to store key key-value pairs similar to Python Dictionary (Dic), keys and values type of map should be of a type that extends DataType. Key won’t accept null/None values whereas map of the key can have None/Null value.
Related Articles
- Convert DataFrame Columns to Map in PySpark
- PySpark StructType & StructField Explained with Examples
- Convert StructType to Map in PySpark
- Convert Dictionary/Map to Multiple Columns in PySpark
- Create PySpark DataFrame From List of Dictionary (Dict) Objects
- PySpark Convert DataFrame Columns to MapType (Dict)
- PySpark Convert StructType (struct) to Dictionary/MapType (map)
- Explain PySpark element_at() with Examples
- Iterate over Elements of Array in PySpark DataFrame
A nice documentation for pyspark. Can you give some sample projects and real time examples for better understanding?