Spark Flatten Nested Array to Single Array Column

Problem: How to flatten the Array of Array or Nested Array DataFrame column into a single array column using Spark.

Solution: Spark SQL provides flatten function to convert an Array of Array column (nested Array) ArrayType(ArrayType(StringType)) to single array column on Spark DataFrame using scala example.

Related:

First, let’s create a DataFrame with an array column within another array column, from below example column “subjects” is an array of ArraType which holds all subjects learned.


  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
     spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show()

df.printSchema() and df.show() returns the following schema and table.


root
 |-- name: string (nullable = true)
 |-- subjects: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)


+-------+-----------------------------------+
|name   |subjects                           |
+-------+-----------------------------------+
|James  |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]]    |
+-------+-----------------------------------+

1. Flatten – Nested array to single array

Flatten – Creates a single array from an array of arrays (nested array). If a structure of nested arrays is deeper than two levels then only one level of nesting is removed. below snippet convert “subjects” column to a single array.


// Flatten - Nested array to single array
Syntax : flatten(e: Column): Column

df.select($"name",flatten($"subjects")).show(false)

Outputs:


// Outputs:
+-------+-------------------------------+
|name   |flatten(subjects)              |
+-------+-------------------------------+
|James  |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python]    |
+-------+-------------------------------+

2. Complete Spark Flatten Nested Array Example (flatten function)


package com.sparkbyexamples.spark.dataframe.functions.collection

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.{explode, flatten}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}

object ArrayOfArrayType extends App {

  val spark = SparkSession.builder().appName("SparkByExamples.com")
    .master("local[1]")
    .getOrCreate()

  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show(false)

  // Convert Array of Array into Single array
  df.select($"name",flatten($"subjects")).show(false)

}

Conclusion

In this article, you have learned how to defined nested array using StructType and how to flatten the nested array to a single array using Spark Flatten function and Scala example.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply