PySpark MapType (Dict) Usage with Examples

PySpark MapType (also called map type) is a data type to represent Python Dictionary (dict) to store key-value pair, a MapType object comprises three fields, keyType (a DataType), valueType (a DataType) and valueContainsNull (a BooleanType). What is PySpark MapType PySpark MapType is used to represent map key-value pair similar to…

Continue Reading PySpark MapType (Dict) Usage with Examples

PySpark Create DataFrame From Dictionary (Dict)

PySpark MapType (map) is a key-value pair that is used to create a DataFrame with map columns similar to Python Dictionary (Dict) data structure. While reading a JSON file with dictionary data, PySpark by default infers the dictionary (Dict) data and create a DataFrame with MapType column, Note that PySpark…

Continue Reading PySpark Create DataFrame From Dictionary (Dict)

PySpark StructType & StructField Explained with Examples

PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and creating complex columns like nested struct, array and map columns. StructType is a collection of StructField's that defines column name, column data type, boolean to specify if the field can be nullable or not…

Continue Reading PySpark StructType & StructField Explained with Examples

Spark Schema – Explained with Examples

Spark Schema defines the structure of the DataFrame which you can get by calling printSchema() method on the DataFrame object. Spark SQL provides StructType & StructField classes to programmatically specify the schema. By default, Spark infers the schema from the data, however, sometimes we may need to define our own…

Continue Reading Spark Schema – Explained with Examples

Spark – explode Array of Map to rows

Problem: How to explode the Array of Map DataFrame columns to rows using Spark. Solution: Spark explode function can be used to explode an Array of Map ArrayType(MapType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with map column in an array.…

Continue Reading Spark – explode Array of Map to rows

Working with Spark MapType DataFrame Column

In this article, I will explain how to create a Spark DataFrame MapType (map) column using org.apache.spark.sql.types.MapType class and applying some DataFrame SQL functions on the map column using the Scala examples. While working with Spark structured (Avro, Parquet e.t.c) or semi-structured (JSON) files, we often get data with complex…

Continue Reading Working with Spark MapType DataFrame Column

Spark SQL Map functions – complete list

In this article, I will explain the usage of the Spark SQL map functions map(), map_keys(), map_values(), map_contact(), map_from_entries() on DataFrame column using Scala example.Though I've explained here with Scala, a similar method could be used to work Spark SQL map functions with PySpark and if time permits I will cover it in the future. If…

Continue Reading Spark SQL Map functions – complete list

Spark SQL StructType & StructField with examples

Spark SQL StructType & StructField classes are used to programmatically specify the schema to the DataFrame and creating complex columns like nested struct, array and map columns. StructType is a collection of StructField's. Using StructField we can define column name, column data type, nullable column (boolean to specify if the…

Continue Reading Spark SQL StructType & StructField with examples