• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing PySpark Create DataFrame From Dictionary (Dict)

PySpark MapType (map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict) data structure.

While reading a JSON file with dictionary data, PySpark by default infers the dictionary (Dict) data and create a DataFrame with MapType column, Note that PySpark doesn’t have a dictionary type instead it uses MapType to store the dictionary data.

In this article, I will explain how to create a PySpark DataFrame from Python manually, and explain how to read Dict elements by key, and some map operations using SQL functions. First, let’s create data with a list of Python Dictionary (Dict) objects; below example has two columns of type String & Dictionary as {key:value,key:value}.


dataDictionary = [
        ('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'red','eye':'grey'}),
        ('Jefferson',{'hair':'red','eye':''})
        ]

Create DataFrame from Dictionary (Dict) Example

Now create a PySpark DataFrame from Dictionary object and name it as properties, In Pyspark key & value types can be any Spark type that extends org.apache.spark.sql.types.DataType.


df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])
df.printSchema()
df.show(truncate=False)

This displays the PySpark DataFrame schema & result of the DataFrame. Notice that the dictionary column properties is represented as map on below schema.


root
 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |[eye -> brown, hair -> black]|
|Michael   |[eye ->, hair -> brown]      |
|Robert    |[eye -> black, hair -> red]  |
|Washington|[eye -> grey, hair -> grey]  |
|Jefferson |[eye -> , hair -> brown]     |
+----------+-----------------------------+

Create a DataFrame Dictionary Column Using StructType

As I said in the beginning, PySpark doesn’t have a Dictionary type instead it uses MapType to store the dictionary object, below is an example of how to create a DataFrame column MapType using pyspark.sql.types.StructType.

MapType(StringType(),StringType()) – Here both key and value is a StringType.


from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
  StructField('name', StringType(), True),
  StructField('properties', MapType(StringType(),StringType()),True)
])
df2 = spark.createDataFrame(data=dataDictionary, schema = schema)

This creates a DataFrame with the same schema as above.

Extract Values from DataFrame Dictionary Column

Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties (MapType column)


df.rdd.map(lambda x: 
    (x.name,x.properties["hair"],x.properties["eye"])
  ).toDF(["hair","eye"]).show()
+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+

Let’s use another way to get the value of a key from Map using getItem() of Column type, this method takes key as argument and returns a value.


df.withColumn("hair",df.properties.getItem("hair")) \
  .withColumn("eye",df.properties.getItem("eye")) \
  .drop("properties") \
  .show()

df.withColumn("hair",df.properties["hair"]) \
  .withColumn("eye",df.properties["eye"]) \
  .drop("properties") \
  .show()

Conclusion

Spark doesn’t have a Dict type. Instead, it contains a MapType, also referred to as a map, to store Python Dictionary elements; in this article, you have learned how to create a MapType column using StructType and retrieve values from the map column.

Happy Learning !!

          References

          Naveen Nelamali

          Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium