Spark SQL provides built-in standard array functions defines in DataFrame API, these come in handy when we need to make operations on array (ArrayType) column. All these accept input as, array column and several other arguments based on the function.
When possible try to leverage standard library as they are little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee on performance.
Spark SQL array functions are grouped as collection functions “collection_funcs” in spark SQL along with several map functions. These array functions come handy when we want to perform some operations and transformations on array columns.
Though I’ve explained here with Scala, a similar methods could be used to work Spark SQL array function with PySpark and if time permits I will cover it in the future. If you are looking for PySpark, I would still recommend reading through this article as it would give you an Idea on Spark array functions and usage.
Spark SQL Array Functions:
Array Function Syntax | Array Function Description |
---|---|
array_contains(column: Column, value: Any) | Check if a value presents in an array column. Return below values. true - Returns if value presents in an array. false - When valu eno presents. null - when array is null. |
array_distinct(e: Column) | Return distinct values from the array after removing duplicates. |
array_except(col1: Column, col2: Column) | Returns all elements from col1 array but not in col2 array. |
array_intersect(col1: Column, col2: Column) | Returns all elements that are present in col1 and col2 arrays. |
array_join(column: Column, delimiter: String, nullReplacement: String) array_join(column: Column, delimiter: String) | Concatenates all elments of array column with using provided delimeter. When Null valeus are present, they replaced with 'nullReplacement' string |
array_max(e: Column) | Return maximum values in an array |
array_min(e: Column) | Return minimum values in an array |
array_position(column: Column, value: Any) | Returns a position/index of first occurrence of the 'value' in the given array. Returns position as long type and the position is not zero based instead starts with 1. Returns zero when value is not found. Returns null when any of the arguments are null. |
array_remove(column: Column, element: Any) | Returns an array after removing all provided 'value' from the given array. |
array_repeat(e: Column, count: Int) | Creates an array containing the first argument repeated the number of times given by the second argument. |
array_repeat(left: Column, right: Column) | Creates an array containing the first argument repeated the number of times given by the second argument. |
array_sort(e: Column) | Returns the sorted array of the given input array. All null values are placed at the end of the array. |
array_union(col1: Column, col2: Column) | Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates. |
arrays_overlap(a1: Column, a2: Column) | true - if `a1` and `a2` have at least one non-null element in common false - if `a1` and `a2` have completely different elements. null - if both the arrays are non-empty and any of them contains a `null` |
arrays_zip(e: Column*) | Returns a merged array of structs in which the N-th struct contains all N-th values of input |
concat(exprs: Column*) | Concatenates all elements from a given columns |
element_at(column: Column, value: Any) | Returns an element of an array located at the 'value' input position. |
exists(column: Column, f: Column => Column) | Checks if the column presents in an array column. |
explode(e: Column) | Create a row for each element in the array column |
explode_outer ( e : Column ) | Create a row for each element in the array column. Unlike explode, if the array is null or empty, it returns null. |
filter(column: Column, f: Column => Column) filter(column: Column, f: (Column, Column) => Column) | Returns an array of elements for which a predicate holds in a given array |
flatten(e: Column) | Creates a single array from an array of arrays column. |
forall(column: Column, f: Column => Column) | Returns whether a predicate holds for every element in the array. |
posexplode(e: Column) | Creates a row for each element in the array and creaes a two columns "pos' to hold the position of the array element and the 'col' to hold the actual array value. |
posexplode_outer(e: Column) | Creates a row for each element in the array and creaes a two columns "pos' to hold the position of the array element and the 'col' to hold the actual array value. Unlike posexplode, if the array is null or empty, it returns null,null for pos and col columns. |
reverse(e: Column) | Returns the array of elements in a reverse order. |
sequence(start: Column, stop: Column) | Generate the sequence of numbers from start to stop number. |
sequence ( start : Column , stop : Column , step : Column ) | Generate the sequence of numbers from start to stop number by incrementing with given step value. |
shuffle(e: Column) | Shuffle the given array |
size(e: Column) | Return the length of an array. |
slice(x: Column, start: Int, length: Int) | Returns an array of elements from position 'start' and the given length. |
sort_array(e: Column) | Sorts the array in an ascending order. Null values are placed at the beginning. |
sort_array(e: Column, asc: Boolean) | Sorts the array in an ascending or descending order based of the boolean parameter. For assending, Null values are placed at the beginning. And for desending they are places at the end. |
transform(column: Column, f: Column => Column) transform(column: Column, f: (Column, Column) => Column) | Returns an array of elments after applying transformation. |
zip_with(left: Column, right: Column, f: (Column, Column) => Column) | Merges two input arrays. |
aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column) | Aggregates |
Hey, thanks for good job putting all of this together, but I think you have date functions here instead of array functions.
Hi Szymon, Thanks for pointing it out. I have fixed it now.