Problem: How to flatten the Array of Array or Nested Array DataFrame column into a single array column using Spark.
Solution: Spark SQL provides flatten
function to convert an Array of Array column (nested Array) ArrayType(ArrayType(StringType))
to single array column on Spark DataFrame using scala example.
Related:
First, let’s create a DataFrame with an array column within another array column, from below example column “subjects” is an array of ArraType which holds all subjects learned.
val arrayArrayData = Seq(
Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
)
val arrayArraySchema = new StructType().add("name",StringType)
.add("subjects",ArrayType(ArrayType(StringType)))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
df.printSchema()
df.show()
df.printSchema() and df.show() returns the following schema and table.
root
|-- name: string (nullable = true)
|-- subjects: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
+-------+-----------------------------------+
|name |subjects |
+-------+-----------------------------------+
|James |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]] |
+-------+-----------------------------------+
Flatten – Nested array to single array
Flatten – Creates a single array from an array of arrays (nested array). If a structure of nested arrays is deeper than two levels then only one level of nesting is removed. below snippet convert “subjects” column to a single array.
Syntax : flatten(e: Column): Column
df.select($"name",flatten($"subjects")).show(false)
Outputs:
+-------+-------------------------------+
|name |flatten(subjects) |
+-------+-------------------------------+
|James |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python] |
+-------+-------------------------------+
Complete Spark Flatten Nested Array Example (flatten function)
package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.{explode, flatten}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
object ArrayOfArrayType extends App {
val spark = SparkSession.builder().appName("SparkByExamples.com")
.master("local[1]")
.getOrCreate()
val arrayArrayData = Seq(
Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
)
val arrayArraySchema = new StructType().add("name",StringType)
.add("subjects",ArrayType(ArrayType(StringType)))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
df.printSchema()
df.show(false)
//Convert Array of Array into Single array
df.select($"name",flatten($"subjects")).show(false)
}
Conclusion
In this article, you have learned how to defined nested array using StructType and how to flatten the nested array to a single array using Spark Flatten function and Scala example.
Happy Learning !!