• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing PySpark Create DataFrame From Dictionary (Dict)

PySpark MapType (map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict) data structure.

Advertisements

While reading a JSON file with dictionary data, PySpark by default infers the dictionary (Dict) data and create a DataFrame with MapType column, Note that PySpark doesn’t have a dictionary type instead it uses MapType to store the dictionary data.

In this article, I will explain how to create a PySpark DataFrame from Python manually, and explain how to read Dict elements by key, and some map operations using SQL functions. First, let’s create data with a list of Python Dictionary (Dict) objects; below example has two columns of type String & Dictionary as {key:value,key:value}.


dataDictionary = [
        ('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'red','eye':'grey'}),
        ('Jefferson',{'hair':'red','eye':''})
        ]

Create DataFrame from Dictionary (Dict) Example

Now create a PySpark DataFrame from Dictionary object and name it as properties, In Pyspark key & value types can be any Spark type that extends org.apache.spark.sql.types.DataType.


df = spark.createDataFrame(data=dataDictionary, schema = ["name","properties"])
df.printSchema()
df.show(truncate=False)

This displays the PySpark DataFrame schema & result of the DataFrame. Notice that the dictionary column properties is represented as map on below schema.


root
 |-- name: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |[eye -> brown, hair -> black]|
|Michael   |[eye ->, hair -> brown]      |
|Robert    |[eye -> black, hair -> red]  |
|Washington|[eye -> grey, hair -> grey]  |
|Jefferson |[eye -> , hair -> brown]     |
+----------+-----------------------------+

Create a DataFrame Dictionary Column Using StructType

As I said in the beginning, PySpark doesn’t have a Dictionary type instead it uses MapType to store the dictionary object, below is an example of how to create a DataFrame column MapType using pyspark.sql.types.StructType.

MapType(StringType(),StringType()) – Here both key and value is a StringType.


from pyspark.sql.types import StructField, StructType, StringType, MapType
schema = StructType([
  StructField('name', StringType(), True),
  StructField('properties', MapType(StringType(),StringType()),True)
])
df2 = spark.createDataFrame(data=dataDictionary, schema = schema)

This creates a DataFrame with the same schema as above.

Extract Values from DataFrame Dictionary Column

Let’s see how to extract the key and values from the PySpark DataFrame Dictionary column. Here I have used PySpark map transformation to read the values of properties (MapType column)


df.rdd.map(lambda x: 
    (x.name,x.properties["hair"],x.properties["eye"])
  ).toDF(["hair","eye"]).show()
+----------+-----+-----+
|      name| hair|  eye|
+----------+-----+-----+
|     James|black|brown|
|   Michael|brown| null|
|    Robert|  red|black|
|Washington| grey| grey|
| Jefferson|brown|     |
+----------+-----+-----+

Let’s use another way to get the value of a key from Map using getItem() of Column type, this method takes key as argument and returns a value.


df.withColumn("hair",df.properties.getItem("hair")) \
  .withColumn("eye",df.properties.getItem("eye")) \
  .drop("properties") \
  .show()

df.withColumn("hair",df.properties["hair"]) \
  .withColumn("eye",df.properties["eye"]) \
  .drop("properties") \
  .show()

Conclusion

Spark doesn’t have a Dict type. Instead, it contains a MapType, also referred to as a map, to store Python Dictionary elements; in this article, you have learned how to create a MapType column using StructType and retrieve values from the map column.

Happy Learning !!

          References