Spark SQL Array Functions Complete List

Spark SQL provides built-in standard array functions defines in DataFrame API, these come in handy when we need to make operations on array (ArrayType) column. All these accept input as, array column and several other arguments based on the function.

When possible try to leverage standard library as they are little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee on performance.

Spark SQL array functions are grouped as collection functions “collection_funcs” in spark SQL along with several map functions. These array functions come handy when we want to perform some operations and transformations on array columns.

 

Though I’ve explained here with Scala, a similar methods could be used to work Spark SQL array function with PySpark and if time permits I will cover it in the future. If you are looking for PySpark, I would still recommend reading through this article as it would give you an Idea on Spark array functions and usage.

 

Spark SQL Array Functions:

Date Functions SyntaxDate Function Description
current_date () : ColumnReturns the current date as a date column.
date_format(dateExpr: Column, format: String): ColumnConverts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
to_date(e: Column): ColumnConverts the column into `DateType` by casting rules to `DateType`.
to_date(e: Column, fmt: String): ColumnConverts the column into a `DateType` with a specified format
add_months(startDate: Column, numMonths: Int): ColumnReturns the date that is `numMonths` after `startDate`.
date_add(start: Column, days: Int): Column
date_sub(start: Column, days: Int): Column
Returns the date that is `days` days after `start`
datediff(end: Column, start: Column): ColumnReturns the number of days from `start` to `end`.
months_between(end: Column, start: Column): ColumnReturns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
months_between(end: Column, start: Column, roundOff: Boolean): ColumnReturns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, dayOfWeek: String): ColumnReturns the first date which is later than the value of the `date` column that is on the specified day of the week.
For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02 because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: String): ColumnReturns date truncated to the unit specified by the format.
For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format: String, timestamp: Column): ColumnReturns timestamp truncated to the unit specified by the format.
For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week', 'month', 'quarter'
year(e: Column): ColumnExtracts the year as an integer from a given date/timestamp/string
quarter(e: Column): ColumnExtracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): ColumnExtracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): ColumnExtracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
dayofmonth(e: Column): ColumnExtracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(e: Column): ColumnExtracts the day of the year as an integer from a given date/timestamp/string.
weekofyear(e: Column): ColumnExtracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): ColumnReturns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.
from_unixtime(ut: Column): ColumnConverts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: String): ColumnConverts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp(): ColumnReturns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): ColumnConverts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: String): ColumnConverts time string with given pattern to Unix timestamp (in seconds).

Array function Examples

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Spark SQL Array Functions Complete List