Problem: How to define Spark DataFrame using the nested array column (Array of Array)?
Solution: Using StructType we can define an Array of Array (Nested Array) ArrayType(ArrayType(StringType))
DataFrame column using Scala example.
The below example creates a DataFrame with a nested array column. From below example column “subjects” is an array of ArraType which holds subjects learned array column.
val arrayArrayData = Seq(
Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
)
val arrayArraySchema = new StructType().add("name",StringType)
.add("subjects",ArrayType(ArrayType(StringType)))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
df.printSchema()
df.show()
df.printSchema() and df.show() returns the following schema and table.
root
|-- name: string (nullable = true)
|-- subjects: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
+-------+-----------------------------------+
|name |subjects |
+-------+-----------------------------------+
|James |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]] |
+-------+-----------------------------------+
1. Explode Nested Array
explode function creates a new column ‘col’ with rows representing every element in an array.
import spark.implicits._
val df2 = df.select($"name",explode($"subjects"))
df2.printSchema()
df2.show(false)
Outputs:
// Output:
root
|-- name: string (nullable = true)
|-- col: string (nullable = true)
+-------+------------------+
|name |col |
+-------+------------------+
|James |[Java, Scala, C++]|
|James |[Spark, Java] |
|Michael|[Spark, Java, C++]|
|Michael|[Spark, Java] |
|Robert |[CSharp, VB] |
|Robert |[Spark, Python] |
+-------+------------------+
2. Flatten Nested Array
If you want to flat the arrays, use flatten function which converts array of array columns to a single array on DataFrame. It is similar to the scala flat function.
// Flatten Nested Array
df.select($"name",flatten($"subjects")).show(false)
Outputs:
// Output:
+-------+-------------------------------+
|name |flatten(subjects) |
+-------+-------------------------------+
|James |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python] |
+-------+-------------------------------+
3. Complete Explode and Flatten Example
// Complete Explode and Flatten Example
package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.{explode, flatten}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
object ArrayOfArrayType extends App {
val spark = SparkSession.builder().appName("SparkByExamples.com")
.master("local[1]")
.getOrCreate()
val arrayArrayData = Seq(
Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
)
val arrayArraySchema = new StructType().add("name",StringType)
.add("subjects",ArrayType(ArrayType(StringType)))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
df.printSchema()
df.show(false)
import spark.implicits._
val df2 = df.select($"name",explode($"subjects"))
df2.printSchema()
df2.show(false)
// Convert Array of Array into Single array
df.select($"name",flatten($"subjects")).show(false)
}
Happy Learning !!