Spark SQL Array Functions Complete List

Spark SQL provides built-in standard array functions defines in DataFrame API, these come in handy when we need to make operations on array (ArrayType) column. All these accept input as, array column and several other arguments based on the function.

When possible try to leverage standard library as they are little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee on performance.

Spark SQL array functions are grouped as collection functions “collection_funcs” in spark SQL along with several map functions. These array functions come handy when we want to perform some operations and transformations on array columns.

Though I’ve explained here with Scala, a similar methods could be used to work Spark SQL array function with PySpark and if time permits I will cover it in the future. If you are looking for PySpark, I would still recommend reading through this article as it would give you an Idea on Spark array functions and usage.

 Spark SQL Array Functions:

ARRAY FUNCTION SYNTAXARRAY FUNCTION DESCRIPTION
array_contains(column: Column, value: Any)Check if a value presents in an array column. Return below values.
true – Returns if value presents in an array.
false – When valu eno presents.
null – when array is null.
array_distinct(e: Column)Return distinct values from the array after removing duplicates.
array_except(col1: Column, col2: Column)Returns all elements from col1 array but not in col2 array.
array_intersect(col1: Column, col2: Column)Returns all elements that are present in col1 and col2 arrays.
array_join(column: Column, delimiter: String, nullReplacement: String)
array_join(column: Column, delimiter: String)
Concatenates all elments of array column with using provided delimeter. When Null valeus are present, they replaced with ‘nullReplacement’ string
array_max(e: Column)Return maximum values in an array
array_min(e: Column)Return minimum values in an array
array_position(column: Column, value: Any)Returns a position/index of first occurrence of the ‘value’ in the given array. Returns position as long type and the position is not zero based instead starts with 1.
Returns zero when value is not found.
Returns null when any of the arguments are null.
array_remove(column: Column, element: Any)Returns an array after removing all provided ‘value’ from the given array.
array_repeat(e: Column, count: Int)Creates an array containing the first argument repeated the number of times given by the second argument.
array_repeat(left: Column, right: Column)Creates an array containing the first argument repeated the number of times given by the second argument.
array_sort(e: Column)Returns the sorted array of the given input array. All null values are placed at the end of the array.
array_union(col1: Column, col2: Column)Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.
arrays_overlap(a1: Column, a2: Column)true – if `a1` and `a2` have at least one non-null element in common
false – if `a1` and `a2` have completely different elements.
null – if both the arrays are non-empty and any of them contains a `null`
arrays_zip(e: Column*)Returns a merged array of structs in which the N-th struct contains all N-th values of input
concat(exprs: Column*)Concatenates all elements from a given columns
element_at(column: Column, value: Any)Returns an element of an array located at the ‘value’ input position.
exists(column: Column, f: Column => Column)Checks if the column presents in an array column.
explode(e: Column)Create a row for each element in the array column
explode_outer ( e : Column )Create a row for each element in the array column. Unlike explode, if the array is null or empty, it returns null.
filter(column: Column, f: Column => Column)
filter(column: Column, f: (Column, Column) => Column)
Returns an array of elements for which a predicate holds in a given array
flatten(e: Column)Creates a single array from an array of arrays column.
forall(column: Column, f: Column => Column)Returns whether a predicate holds for every element in the array.
posexplode(e: Column)Creates a row for each element in the array and creaes a two columns “pos’ to hold the position of the array element and the ‘col’ to hold the actual array value.
posexplode_outer(e: Column)Creates a row for each element in the array and creaes a two columns “pos’ to hold the position of the array element and the ‘col’ to hold the actual array value. Unlike posexplode, if the array is null or empty, it returns null,null for pos and col columns.
reverse(e: Column)Returns the array of elements in a reverse order.
sequence(start: Column, stop: Column)Generate the sequence of numbers from start to stop number.
sequence ( start : Column , stop : Column , step : Column )Generate the sequence of numbers from start to stop number by incrementing with given step value.
shuffle(e: Column)Shuffle the given array
size(e: Column)Return the length of an array.
slice(x: Column, start: Int, length: Int)Returns an array of elements from position ‘start’ and the given length.
sort_array(e: Column)Sorts the array in an ascending order. Null values are placed at the beginning.
sort_array(e: Column, asc: Boolean)Sorts the array in an ascending or descending order based of the boolean parameter. For assending, Null values are placed at the beginning. And for desending they are places at the end.
transform(column: Column, f: Column => Column)
transform(column: Column, f: (Column, Column) => Column)
Returns an array of elments after applying transformation.
zip_with(left: Column, right: Column, f: (Column, Column) => Column)Merges two input arrays.
aggregate(
expr: Column,
zero: Column,
merge: (Column, Column) => Column,
finish: Column => Column)
Aggregates

Array function Examples

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 3 Comments

  1. Sor

    array_distinct works for me, thanks!!!!

  2. Szymon

    Hey, thanks for good job putting all of this together, but I think you have date functions here instead of array functions.

    1. Naveen (NNK)

      Hi Szymon, Thanks for pointing it out. I have fixed it now.