You are currently viewing Spark – Define DataFrame with Nested Array

Problem: How to define Spark DataFrame using the nested array column (Array of Array)?

Solution: Using StructType we can define an Array of Array (Nested Array) ArrayType(ArrayType(StringType)) DataFrame column using Scala example.

The below example creates a DataFrame with a nested array column. From below example column “subjects” is an array of ArraType which holds subjects learned array column.


  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
     spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show()

df.printSchema() and df.show() returns the following schema and table.


root
 |-- name: string (nullable = true)
 |-- subjects: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)


+-------+-----------------------------------+
|name   |subjects                           |
+-------+-----------------------------------+
|James  |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]]    |
+-------+-----------------------------------+

1. Explode Nested Array

explode function creates a new column ‘col’ with rows representing every element in an array.


  import spark.implicits._
  val df2 = df.select($"name",explode($"subjects"))
   
  df2.printSchema()
  df2.show(false)

Outputs:


// Output:
root
 |-- name: string (nullable = true)
 |-- col: string (nullable = true)


+-------+------------------+
|name   |col               |
+-------+------------------+
|James  |[Java, Scala, C++]|
|James  |[Spark, Java]     |
|Michael|[Spark, Java, C++]|
|Michael|[Spark, Java]     |
|Robert |[CSharp, VB]      |
|Robert |[Spark, Python]   |
+-------+------------------+

2. Flatten Nested Array

If you want to flat the arrays, use flatten function which converts array of array columns to a single array on DataFrame. It is similar to the scala flat function.


// Flatten Nested Array
df.select($"name",flatten($"subjects")).show(false)

Outputs:


// Output:
+-------+-------------------------------+
|name   |flatten(subjects)              |
+-------+-------------------------------+
|James  |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python]    |
+-------+-------------------------------+

3. Complete Explode and Flatten Example


// Complete Explode and Flatten Example
package com.sparkbyexamples.spark.dataframe.functions.collection

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.{explode, flatten}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}

object ArrayOfArrayType extends App {

  val spark = SparkSession.builder().appName("SparkByExamples.com")
    .master("local[1]")
    .getOrCreate()

  val arrayArrayData = Seq(
    Row("James",List(List("Java","Scala","C++"),List("Spark","Java"))),
    Row("Michael",List(List("Spark","Java","C++"),List("Spark","Java"))),
    Row("Robert",List(List("CSharp","VB"),List("Spark","Python")))
  )

  val arrayArraySchema = new StructType().add("name",StringType)
    .add("subjects",ArrayType(ArrayType(StringType)))

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayArrayData),arrayArraySchema)
  df.printSchema()
  df.show(false)

  import spark.implicits._
  val df2 = df.select($"name",explode($"subjects"))


  df2.printSchema()
  df2.show(false)

  // Convert Array of Array into Single array
  df.select($"name",flatten($"subjects")).show(false)

}

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium