PySpark MapType
(also called map type) is a data type to represent Python Dictionary (dict
) to store key-value pair, a MapType object comprises three fields, keyType (a DataType
), valueType (a DataType
) and valueContainsNull (a BooleanType
PySpark MapType is used to represent map key-value pair similar to python Dictionary (Dict), it extends DataType class which is a superclass of all types in PySpark and takes two mandatory arguments keyType
and valueType
of type DataType and one optional boolean argument valueContainsNull
. keyType and valueType can be any type that extends the DataType class. for e.g StringType
, IntegerType
, ArrayType
, MapType
, StructType
(struct) e.t.c.
1. Create PySpark MapType
In order to use MapType data type first, you need to import it from pyspark.sql.types.MapType
and use MapType()
constructor to create a map object.
from pyspark.sql.types import StringType, MapType
mapCol = MapType(StringType(),StringType(),False)
MapType Key Points:
- The First param
is used to specify the type of the key in the map. - The Second param
is used to specify the type of the value in the map. - Third parm
is an optional boolean type that is used to specify if the value of the second param can acceptNull/None
values. - The key of the map won’t accept
values. - PySpark provides several SQL functions to work with
2. Create MapType From StructType
Let’s see how to create a MapType
by using PySpark StructType & StructField, StructType()
constructor takes list of StructField, StructField takes a fieldname and type of the value.
from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
Now let’s create a DataFrame by using above StructType schema.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
dataDictionary = [
df = spark.createDataFrame(data=dataDictionary, schema = schema)
df.printSchema() yields the Schema and yields the DataFrame output.
|-- Name: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|Name |properties |
|James |[eye -> brown, hair -> black]|
|Michael |[eye ->, hair -> brown] |
|Robert |[eye -> black, hair -> red] |
|Washington|[eye -> grey, hair -> grey] |
|Jefferson |[eye -> , hair -> brown] |
3. Access PySpark MapType Elements
Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties
(MapType column) x: \
(,["hair"],["eye"])) \
|-- name: string (nullable = true)
|-- hair: string (nullable = true)
|-- eye: string (nullable = true)
| name| hair| eye|
| James|black|brown|
| Michael|brown| null|
| Robert| red|black|
|Washington| grey| grey|
| Jefferson|brown| |
Let’s use another way to get the value of a key from Map using getItem()
of Column
type, this method takes a key as an argument and returns a value.
df.withColumn("hair","hair")) \
.withColumn("eye","eye")) \
.drop("properties") \
df.withColumn("hair",["hair"]) \
.withColumn("eye",["eye"]) \
.drop("properties") \
4. Functions
Below are some of the MapType Functions with examples.
4.1 – explode
from pyspark.sql.functions import explode,explode(
| name| key|value|
| James| eye|brown|
| James|hair|black|
| Michael| eye| null|
| Michael|hair|brown|
| Robert| eye|black|
| Robert|hair| red|
|Washington| eye| grey|
|Washington|hair| grey|
| Jefferson| eye| |
| Jefferson|hair|brown|
4.2 map_keys() – Get All Map Keys
from pyspark.sql.functions import map_keys,map_keys(
| name|map_keys(properties)|
| James| [eye, hair]|
| Michael| [eye, hair]|
| Robert| [eye, hair]|
|Washington| [eye, hair]|
| Jefferson| [eye, hair]|
In case if you wanted to get all map keys as Python List. WARNING: This runs very slow.
from pyspark.sql.functions import explode,map_keys
keysDF =
keysList = x:x[0]).collect()
#['eye', 'hair']
4.3 map_values() – Get All map Values
from pyspark.sql.functions import map_values,map_values(
| name|map_values(properties)|
| James| [brown, black]|
| Michael| [, brown]|
| Robert| [black, red]|
|Washington| [grey, grey]|
| Jefferson| [, brown]|
MapType is a map data structure that is used to store key key-value pairs similar to Python Dictionary (Dic), keys and values type of map should be of a type that extends DataType. Key won’t accept null/None values whereas map of the key can have None/Null value.
Related Articles
- Convert DataFrame Columns to Map in PySpark
- PySpark StructType & StructField Explained with Examples
- Convert StructType to Map in PySpark
- Convert Dictionary/Map to Multiple Columns in PySpark
- Create PySpark DataFrame From List of Dictionary (Dict) Objects
- PySpark Convert DataFrame Columns to MapType (Dict)
- PySpark Convert StructType (struct) to Dictionary/MapType (map)
A nice documentation for pyspark. Can you give some sample projects and real time examples for better understanding?