Spark SQL provides several built-in standard functions org.apache.spark.sql.functions
to work with DataFrame/Dataset and SQL queries. All these Spark SQL Functions return org.apache.spark.sql.Column
type.
In order to use these SQL Standard Functions, you need to import below packing into your application.
import org.apache.spark.sql.functions._
Spark also includes more built-in functions that are less common and are not defined here. You can still access them (and all the functions defined here) using the functions.expr()
API and calling them through a SQL expression string. You can find the entire list of functions at SQL API documentation. regr_count
is an example of a function that is built-in but not defined here, because it is less commonly used. To invoke it, use expr("regr_count(yCol, xCol)")
.
When possible try to leverage Spark SQL standard library functions as they are a little bit more compile-time safety, handles null and perform better when compared to UDF’s. If your application is critical on performance try to avoid using custom UDF functions at all costs as these are not guarantee on performance.
Spark groups all these functions into the below categories. Click on the category for the list of functions, syntax, description, and examples
Note: This page is work in progress, please visit again if you are looking for more functions.
Spark SQL Functions – Contents
- String Functions
- Date & Time Functions
- Collection Functions
- Math Functions
- Aggregate Functions
- Window Functions
Spark SQL String Functions
String functions are grouped as “ string_funcs” in spark SQL. Below is a list of functions defined under this group. Click on each link to learn with a Scala example.
String Function Signature | String Function Description |
---|---|
ascii(e: Column): Column | Computes the numeric value of the first character of the string column, and returns the result as an int column. |
base64(e: Column): Column | Computes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64. |
concat_ws(sep: String, exprs: Column*): Column | Concatenates multiple input string columns together into a single string column, using the given separator. |
decode(value: Column, charset: String): Column | Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). |
encode(value: Column, charset: String): Column | Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). |
format_number(x: Column, d: Int): Column | Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column. |
format_string(format: String, arguments: Column*): Column | Formats the arguments in printf-style and returns the result as a string column. |
initcap(e: Column): Column | Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, "hello world" will become "Hello World". |
instr(str: Column, substring: String): Column | Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null. |
length(e: Column): Column | Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros. |
lower(e: Column): Column | Converts a string column to lower case. |
levenshtein ( l : Column , r : Column ) : Column | Computes the Levenshtein distance of the two given string columns. |
locate(substr: String, str: Column): Column | Locate the position of the first occurrence of substr. |
locate(substr: String, str: Column, pos: Int): Column | Locate the position of the first occurrence of substr in a string column, after position pos. |
lpad(str: Column, len: Int, pad: String): Column | Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
ltrim(e: Column): Column | Trim the spaces from left end for the specified string value. |
regexp_extract(e: Column, exp: String, groupIdx: Int): Column | Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. |
regexp_replace(e: Column, pattern: String, replacement: String): Column | Replace all substrings of the specified string value that match regexp with rep. |
regexp_replace(e: Column, pattern: Column, replacement: Column): Column | Replace all substrings of the specified string value that match regexp with rep. |
unbase64(e: Column): Column | Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64. |
rpad(str: Column, len: Int, pad: String): Column | Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
repeat(str: Column, n: Int): Column | Repeats a string column n times, and returns it as a new string column. |
rtrim(e: Column): Column | Trim the spaces from right end for the specified string value. |
rtrim(e: Column, trimString: String): Column | Trim the specified character string from right end for the specified string column. |
soundex(e: Column): Column | Returns the soundex code for the specified expression |
split(str: Column, regex: String): Column | Splits str around matches of the given regex. |
split(str: Column, regex: String, limit: Int): Column | Splits str around matches of the given regex. |
substring(str: Column, pos: Int, len: Int): Column | Substring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary type |
substring_index(str: Column, delim: String, count: Int): Column | Returns the substring from string str before count occurrences of the delimiter delim. * If count is positive, everything the left of the final delimiter (counting from left) is * returned. If count is negative, every to the right of the final delimiter (counting from the * right) is returned. substring_index performs a case-sensitive match when searching for delim. |
overlay(src: Column, replaceString: String, pos: Int, len: Int): Column | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString` and proceeding for `len` bytes. |
overlay(src: Column, replaceString: String, pos: Int): Column | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString`. |
translate(src: Column, matchingString: String, replaceString: String): Column | Translate any character in the src by a character in replaceString. * The characters in replaceString correspond to the characters in matchingString. * The translate will happen when any character in the string matches the character * in the `matchingString`. |
trim(e: Column): Column | Trim the spaces from both ends for the specified string column. |
trim(e: Column, trimString: String): Column | Trim the specified character from both ends for the specified string column. |
upper(e: Column): Column | Converts a string column to upper case. |
Spark SQL Date and Time Functions
Date Functions Syntax | Date Function Description |
---|---|
current_date () : Column | Returns the current date as a date column. |
date_format(dateExpr: Column, format: String): Column | Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. |
to_date(e: Column): Column | Converts the column into `DateType` by casting rules to `DateType`. |
to_date(e: Column, fmt: String): Column | Converts the column into a `DateType` with a specified format |
add_months(startDate: Column, numMonths: Int): Column | Returns the date that is `numMonths` after `startDate`. |
date_add(start: Column, days: Int): Column date_sub(start: Column, days: Int): Column | Returns the date that is `days` days after `start` |
datediff(end: Column, start: Column): Column | Returns the number of days from `start` to `end`. |
months_between(end: Column, start: Column): Column | Returns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month. |
months_between(end: Column, start: Column, roundOff: Boolean): Column | Returns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise. |
next_day(date: Column, dayOfWeek: String): Column | Returns the first date which is later than the value of the `date` column that is on the specified day of the week. For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02 because that is the first Sunday after 2015-07-27. |
trunc(date: Column, format: String): Column | Returns date truncated to the unit specified by the format. For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01 format: 'year', 'yyyy', 'yy' to truncate by year, 'month', 'mon', 'mm' to truncate by month |
date_trunc(format: String, timestamp: Column): Column | Returns timestamp truncated to the unit specified by the format. For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-01-01 00:00:00 format: 'year', 'yyyy', 'yy' to truncate by year, 'month', 'mon', 'mm' to truncate by month, 'day', 'dd' to truncate by day, Other options are: 'second', 'minute', 'hour', 'week', 'month', 'quarter' |
year(e: Column): Column | Extracts the year as an integer from a given date/timestamp/string |
quarter(e: Column): Column | Extracts the quarter as an integer from a given date/timestamp/string. |
month(e: Column): Column | Extracts the month as an integer from a given date/timestamp/string |
dayofweek(e: Column): Column | Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday |
dayofmonth(e: Column): Column | Extracts the day of the month as an integer from a given date/timestamp/string. |
dayofyear(e: Column): Column | Extracts the day of the year as an integer from a given date/timestamp/string. |
weekofyear(e: Column): Column | Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601 |
last_day(e: Column): Column | Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015. |
from_unixtime(ut: Column): Column | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format. |
from_unixtime(ut: Column, f: String): Column | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format. |
unix_timestamp(): Column | Returns the current Unix timestamp (in seconds) as a long |
unix_timestamp(s: Column): Column | Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale. |
unix_timestamp(s: Column, p: String): Column | Converts time string with given pattern to Unix timestamp (in seconds). |
PySpark Timestamp Function Signature | Timestamp Function Description |
---|---|
current_timestamp () | Returns the current timestamp as a timestamp column |
hour(column) | Extracts the hours as an integer from a given date/timestamp/string. |
minute(column) | Extracts the minutes as an integer from a given date/timestamp/string. |
second(column) | Extracts the seconds as an integer from a given date/timestamp/string. |
to_timestamp(column) | Converts to a timestamp by casting rules to `TimestampType`. |
to_timestamp(column, fmt) | Converts time string with the given pattern to timestamp. |
Spark SQL Collection Functions
Collection functions (Array, Map)
Name | General HBase Shell Commands Usage |
---|---|
status | 1 active master, 1 backup masters, 22 servers, 0 dead, 221.8182 average load |
version | 1.2.0-cdh5.14.4, rUnknown, Tue Jun 12 04:00:36 PDT 2018 |
whoami | Returns account and group information |
Spark SQL Math Functions
Below are a subset of Mathematical and Statistical functions
Math Trigonometric Function Syntax | Math Trigonometric Function Description |
---|---|
asin(e: Column): Column asin(columnName: String): Column | Return arcsine or inverse sine of the input argument, same as java.lang.Math.asin() function. |
acos(e: Column): Column acos(columnName: String): Column | Return arccosine or inverse cosine of input argument, same as java.lang.Math.acos() function. |
atan(e: Column): Column atan(columnName: String): Column | Return arctangent or inverse tangent of input argument, same as java.lang.Math.atan() function. |
sin ( e : Column ) : Column sin(columnName: String): Column | Return sine of the angle, same as java.lang.Math.sin() function. |
sinh ( e : Column ) : Column sinh(columnName: String): Column | Return hyperbolic sine of the given value, same as java.lang.Math.sinh() function. |
cos(e: Column): Column cos(columnName: String): Column | Return cosine of the angle, same as java.lang.Math.cos() function. |
cosh(e: Column): Column cosh(columnName: String): Column | Return hyperbolic cosine of the angle, same as java.lang.Math.cosh() function. |
tan(e: Column): Column tan(columnName: String): Column | Return tangent of the given value, same as java.lang.Math.tan() function. |
tanh(e: Column): Column tanh(columnName: String): Column | Return hyperbolic tangent of the given value, same as java.lang.Math.tanh() function. |
Spark SQL Log Math Functions Signature | Spark Functions Description |
---|---|
log(columnName: String): Column | Computes the natural logarithm of the given column. |
log(base: Double, a: Column): Column log(base: Double, columnName: String): Column | Returns the first argument-base logarithm of the second argument. |
log10(e: Column): Column log10(columnName: String): Column | Computes the logarithm of the given value in base 10. |
log1p(e: Column): Column log1p(columnName: String): Column | Computes the natural logarithm of the given value plus one. |
log2(expr: Column): Column log2(columnName: String): Column | Computes the logarithm of the given column in base 2. |
Aggregate Functions
PySpark Timestamp Function Signature | Timestamp Function Description |
---|---|
current_timestamp () | Returns the current timestamp as a timestamp column |
hour(column) | Extracts the hours as an integer from a given date/timestamp/string. |
minute(column) | Extracts the minutes as an integer from a given date/timestamp/string. |
second(column) | Extracts the seconds as an integer from a given date/timestamp/string. |
to_timestamp(column) | Converts to a timestamp by casting rules to `TimestampType`. |
to_timestamp(column, fmt) | Converts time string with the given pattern to timestamp. |
Window Functions
Date & Time Window Function Syntax | Date & Time Window Function Description |
---|---|
window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column | Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. |
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column | Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC |
window(timeColumn: Column, windowDuration: String): Column | Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. |
Sorting Functions
[table “42” not found /]Conclusion:
In this post, I’ve listed links to several commonly use built-in standard library functions where you could read usage, syntax, and examples. Do you think if this post is helpful and easy to understand, please leave me a comment.
Happy Learning !!
Hi, Your content is great. But when i open any page and if you highlight which page it is from the list given on the left side list will be helpful. If you highlight the link on the left side, it will be great. Otherwise we have to manually search them. Thanks.