Spark ArrayType
(array) is a collection data type that extends DataType
class, In this article, I will explain how to create a DataFrame ArrayType column using Spark SQL org.apache.spark.sql.types.ArrayType class and applying some SQL functions on the array column using Scala examples.
While working with Spark structured (Avro, Parquet e.t.c) or semi-structured (JSON) files, we often get data with complex structures like MapType, ArrayType, Array[StructType] e.t.c. and I will try my best to cover some mostly used functions on ArrayType columns.
- What is Spark ArrayType
- Creating ArrayType map Column on Spark DataFrame
- Using DataTypes.createArrayType()
- Using ArrayType case class
Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.
1. Creating Spark ArrayType Column on DataFrame
You can create the array column of type ArrayType
on Spark DataFrame using using DataTypes.
createArrayType
()
or using the ArrayType
scala case class.
2. Using DataTypes.createArrayType()
DataTypes.createArrayType()
method returns a DataFrame column of ArrayType
. This method takes one argument of type DataType
meaning any type that extends DataType class.
val arrayCol = DataTypes.createArrayType(StringType)
It also has an overloaded method DataTypes.createArrayType(DataType,Boolean)
which takes an additional boolean argument to specify if values of a column can accept null or not.
val mapCol = DataTypes.createArrayType(StringType, true)
3. Using ArrayType case class
We can also create an instance of an ArrayType using ArraType() case class, This takes arguments valueType and one optional argument “valueContainsNull” to specify if a value can accept null.
// Using ArrayType case class
val caseArrayCol = ArrayType(StringType,false)
4. Example of Spark ArrayType Column on DataFrame
val arrayStructureData = Seq(
Row("James,,Smith",List("Java","Scala","C++"),List("Spark","Java"),"OH","CA"),
Row("Michael,Rose,",List("Spark","Java","C++"),List("Spark","Java"),"NY","NJ"),
Row("Robert,,Williams",List("CSharp","VB"),List("Spark","Python"),"UT","NV")
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("languagesAtSchool", ArrayType(StringType))
.add("languagesAtWork", ArrayType(StringType))
.add("currentState", StringType)
.add("previousState", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()
This snippet creates two Array columns “languagesAtSchool” and “languagesAtWork” which ideally defines languages learned at School and languages using at work. And rest of the article will learn several Spark SQL array functions using this DataFrame. printSchema() and show() from above snippet display below output.
root
|-- name: string (nullable = true)
|-- languagesAtSchool: array (nullable = true)
| |-- element: string (containsNull = true)
|-- languagesAtWork: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentState: string (nullable = true)
|-- previousState: string (nullable = true)
+----------------+------------------+---------------+------------+-------------+
| name| languagesAtSchool|languagesAtWork|currentState|previousState|
+----------------+------------------+---------------+------------+-------------+
| James,,Smith|[Java, Scala, C++]| [Spark, Java]| OH| CA|
| Michael,Rose,|[Spark, Java, C++]| [Spark, Java]| NY| NJ|
|Robert,,Williams| [CSharp, VB]|[Spark, Python]| UT| NV|
+----------------+------------------+---------------+------------+-------------+
5. Spark ArrayType (Array) Functions
Spark SQL provides several Array functions to work with the ArrayType column, In this section, we will see some of the most commonly used SQL functions.s
5.1 explode()
Use explode() function to create a new row for each element in the given array column. There are various Spark SQL explode functions available to work with Array columns.
df.select($"name",explode($"languagesAtSchool")).show(false)
+----------------+------+
|name |col |
+----------------+------+
|James,,Smith |Java |
|James,,Smith |Scala |
|James,,Smith |C++ |
|Michael,Rose, |Spark |
|Michael,Rose, |Java |
|Michael,Rose, |C++ |
|Robert,,Williams|CSharp|
|Robert,,Williams|VB |
+----------------+------+
5.2 Split()
Splits the inputted column and returns an array type.
df.select(split($"name",",").as("nameAsArray") ).show(false)
+--------------------+
|nameAsArray |
+--------------------+
|[James, , Smith] |
|[Michael, Rose, ] |
|[Robert, , Williams]|
+--------------------+
5.3 array()
Creates a new array column. All input columns must have the same data type.
df.select($"name",array($"currentState",$"previousState").as("States") ).show(false)
+----------------+--------+
|name |States |
+----------------+--------+
|James,,Smith |[OH, CA]|
|Michael,Rose, |[NY, NJ]|
|Robert,,Williams|[UT, NV]|
+----------------+--------+
5.4 array_contains()
Returns null if the array is null, true if the array contains `value`, and false otherwise.
df.select($"name",array_contains($"languagesAtSchool","Java")
.as("array_contains")).show(false)
+----------------+--------------+
|name |array_contains|
+----------------+--------------+
|James,,Smith |true |
|Michael,Rose, |true |
|Robert,,Williams|false |
+----------------+--------------+
Conclusion
You have learned Spark ArrayType is a collection type similar to an array in other languages that are used to store the same type of elements. ArrayType extends DataType class (superclass of all types in Spark) and also learned how to use some commonly used ArrayType functions.
Happy Learning !!
“val mapCol = DataTypes.createArrayType(StromgType, true)” should probably be “…StringType…” 😉
Thanks for pointing it out. Corrected it now.