Spark Flatten Nested Array to Single Array Column

Problem: How to flatten the Array of Array or Nested Array DataFrame column into a single array column using Spark.

Solution: Spark SQL provides flatten function to convert an Array of Array column (nested Array) ArrayType(ArrayType(StringType)) to single array column on Spark DataFrame using scala example.

Related:

First, let’s create a DataFrame with an array column within another array column, from below example column “subjects” is an array of ArraType which holds all subjects learned.


  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
     spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show()

df.printSchema() and df.show() returns the following schema and table.


root
 |-- name: string (nullable = true)
 |-- subjects: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)


+-------+-----------------------------------+
|name   |subjects                           |
+-------+-----------------------------------+
|James  |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]]    |
+-------+-----------------------------------+

Flatten – Nested array to single array

Flatten – Creates a single array from an array of arrays (nested array). If a structure of nested arrays is deeper than two levels then only one level of nesting is removed. below snippet convert “subjects” column to a single array.


Syntax : flatten(e: Column): Column

df.select($"name",flatten($"subjects")).show(false)

Outputs:


+-------+-------------------------------+
|name   |flatten(subjects)              |
+-------+-------------------------------+
|James  |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python]    |
+-------+-------------------------------+

Complete Spark Flatten Nested Array Example (flatten function)


package com.sparkbyexamples.spark.dataframe.functions.collection

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.{explode, flatten}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}

object ArrayOfArrayType extends App {

  val spark = SparkSession.builder().appName("SparkByExamples.com")
    .master("local[1]")
    .getOrCreate()

  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show(false)

  //Convert Array of Array into Single array
  df.select($"name",flatten($"subjects")).show(false)

}

Conclusion

In this article, you have learned how to defined nested array using StructType and how to flatten the nested array to a single array using Spark Flatten function and Scala example.

Happy Learning !!

Leave a Reply

You are currently viewing Spark Flatten Nested Array to Single Array Column