Site icon Spark By {Examples}

How to Convert Struct type to Columns in Spark

Spark Convert Struct Columns

I have a Spark DataFrame with StructType and would like to convert it to Columns, could you please explain how to do it?

Converting Struct type to columns is one of the most commonly used transformations in Spark DataFrame. In order to explain I will create the Spark DataFrame with Struct columns


val structureData = Seq(
    Row(Row("James ","","Smith"),Row(Row("CA","Los Angles"),Row("CA","Sandiago"))),
    Row(Row("Michael ","Rose",""),Row(Row("NY","New York"),Row("NJ","Newark"))),
    Row(Row("Robert ","","Williams"),Row(Row("DE","Newark"),Row("CA","Las Vegas"))),
    Row(Row("Maria ","Anne","Jones"),Row(Row("PA","Harrisburg"),Row("CA","Sandiago"))),
    Row(Row("Jen","Mary","Brown"),Row(Row("CA","Los Angles"),Row("NJ","Newark")))
  )

val structureSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("address",new StructType()
      .add("current",new StructType()
        .add("state",StringType)
        .add("city",StringType))
      .add("previous",new StructType()
        .add("state",StringType)
        .add("city",StringType)))

val df = spark.createDataFrame(
    spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()

From the above example, df.printSchema() yields the below output.


root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- address: struct (nullable = true)
 |    |-- current: struct (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |-- previous: struct (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- city: string (nullable = true)

As you see, the above DataFrame schema consists of two struct columns name and address. Let’s convert name struct type these into columns.


val df2 = df.select(col("name.*"),
    col("address.current.*"),
    col("address.previous.*"))
val df2Flatten = df2.toDF("fname","mename","lname","currAddState",
    "currAddCity","prevAddState","prevAddCity")
df2Flatten.printSchema()
df2Flatten.show(false)

The above example converts the Spark DataFrame struct column into multiple columns.


root
 |-- name_firstname: string (nullable = true)
 |-- name_middlename: string (nullable = true)
 |-- name_lastname: string (nullable = true)
 |-- address_current_state: string (nullable = true)
 |-- address_current_city: string (nullable = true)
 |-- address_previous_state: string (nullable = true)
 |-- address_previous_city: string (nullable = true)

+--------+------+--------+------------+-----------+------------+-----------+
|fname   |mename|lname   |currAddState|currAddCity|prevAddState|prevAddCity|
+--------+------+--------+------------+-----------+------------+-----------+
|James   |      |Smith   |CA          |Los Angles |CA          |Sandiago   |
|Michael |Rose  |        |NY          |New York   |NJ          |Newark     |
|Robert  |      |Williams|DE          |Newark     |CA          |Las Vegas  |
|Maria   |Anne  |Jones   |PA          |Harrisburg |CA          |Sandiago   |
|Jen     |Mary  |Brown   |CA          |Los Angles |NJ          |Newark     |
+--------+------+--------+------------+-----------+------------+-----------

In case you wanted to Flatten many struct columns automatically, refer to this page.


Hope this helps !!

Exit mobile version