PySpark Create DataFrame From Dictionary (Dict)

Spread the love

PySpark MapType (map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict) data structure.

While reading a JSON file with dictionary data, PySpark by default infers the dictionary (Dict) data and create a DataFrame with MapType column, Note that PySpark doesn’t have a dictionary type instead it uses MapType to store the dictionary data.

In this article, I will explain how to manually create a PySpark DataFrame from Python Dict, and explain how to read Dict elements by key, and some map operations using SQL functions. First, let’s create data with a list of Python Dictionary (Dict) objects, below example has 2 columns of type String & Dictionary as {key:value,key:value}.

dataDictionary = [

Create DataFrame from Dictionary (Dict) Example

Now create a PySpark DataFrame from Dictionary object and name it as properties, In Pyspark key & value types can be any Spark type that extends org.apache.spark.sql.types.DataType.

df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])

This displays the PySpark DataFrame schema & result of the DataFrame. Notice that the dictionary column properties is represented as map on below schema.

 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

|name      |properties                   |
|James     |[eye -> brown, hair -> black]|
|Michael   |[eye ->, hair -> brown]      |
|Robert    |[eye -> black, hair -> red]  |
|Washington|[eye -> grey, hair -> grey]  |
|Jefferson |[eye -> , hair -> brown]     |

Create a DataFrame Dictionary Column Using StructType

As I said in the beginning, PySpark doesn’t have a Dictionary type instead it uses MapType to store the dictionary object, below is an example of how to create a DataFrame column MapType using pyspark.sql.types.StructType.

MapType(StringType(),StringType()) – Here both key and value is a StringType.

from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
  StructField('name', StringType(), True),
  StructField('properties', MapType(StringType(),StringType()),True)
df2 = spark.createDataFrame(data=dataDictionary, schema = schema)

This creates a DataFrame with the same schema as above.

Extract Values from DataFrame Dictionary Column

Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties (MapType column) x: 
|      name| hair|  eye|
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |

Let’s use another way to get the value of a key from Map using getItem() of Column type, this method takes key as argument and returns a value.

df.withColumn("hair","hair")) \
  .withColumn("eye","eye")) \
  .drop("properties") \

df.withColumn("hair",["hair"]) \
  .withColumn("eye",["eye"]) \
  .drop("properties") \


Spark doesn’t have a Dict type, instead it contains a MapType also referred as map to store Python Dictionary elements, In this article you have learn how to create a MapType column on using StructType and retrieving values from map column.

Happy Learning !!


Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark Create DataFrame From Dictionary (Dict)