Spark SQL Function Introduction
Spark SQL functions are a set of built-in functions provided by Apache Spark for performing various operations on DataFrame and Dataset objects in Spark SQL. These functions enable users to manipulate and analyze data within Spark SQL queries, providing a wide range of functionalities similar to those found in traditional SQL databases.
Spark also includes more built-in functions that are less common and are not defined here. You can also access them using the expr()
API and calling them through a SQL expression string. regr_count
is an example that is built-in but not defined here, because it is less commonly used. To invoke it, use expr("regr_count(yCol, xCol)")
.
Utilize these standard library functions provided by Spark SQL whenever feasible. They offer enhanced compile-time safety, manage null values efficiently, and generally deliver superior performance compared to User-Defined Functions (UDFs). In scenarios where performance is crucial, it’s advisable to minimize reliance on custom UDFs, as their performance is not assured.
Spark groups all these SQL functions into the below categories.
- String Functions
- Date & Time Functions
- Collection Functions
- Math Functions
- Aggregate Functions
- Window Functions
- Sorting Functions
Importing SQL Functions in Scala
In Spark with Scala, all these are part of org.apache.spark.sql.functions
and return org.apache.spark.sql.Column
type.
In order to use these, you need to use the following import.
// Import all SQL Functions
import org.apache.spark.sql.functions._
Alternatively, you can import a specific in Scala using the snippet below.
// Import a specific function
// This imports col() function
import org.apache.spark.sql.functions.col
String Functions
String functions are used to manipulate string data within DataFrame and Dataset objects. With functions like substring
, concat
, and length
, you can extract substrings, concatenate strings, and determine string lengths, among other operations.
These enable efficient string manipulation in Spark SQL queries, facilitating tasks such as data cleansing, transformation, and analysis. Leveraging these ensures concise, readable code and efficient processing of string data within Apache Spark’s distributed computing framework.
These are grouped as “ string_funcs” in spark SQL.
STRING FUNCTIONS | DESCRIPTION |
---|---|
ascii(column) | ASCII function in Spark SQL returns the ASCII code of the first character in a string and returns the result as an int column. |
base64(column) | Returns the BASE64 encoding of a binary column as a string column. To reverse, use unbase64. |
concat_ws(sep, exprs*) | Concatenates multiple input string columns together into a single string column, using the given separator. |
decode(column, charset) | Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). |
encode(column, charset) | Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). |
format_number(column, decimal) | Formats numeric column x to a format like ‘#,###,###.##’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column. |
format_string(string, arguments*) | Formats the arguments in printf-style and returns the result as a string column. |
initcap(column) | Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, “hello world” will become “Hello World”. |
instr(str, substring) | Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null. |
length(column) | Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros. |
lower(column) | Converts a string column to lower case. |
levenshtein ( l : Column , r : Column ) | Computes the Levenshtein distance of the two given string columns. |
locate(string, str) | Locate the position of the first occurrence of substr. |
locate(substr, column, pos) | Locate the position of the first occurrence of substr in a string column, after position pos. |
lpad(column, len, pad) | Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
ltrim(column) | Trim the spaces from left end for the specified string value. |
regexp_extract(column, exp, groupIdx) | Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. |
regexp_replace(column, pattern, replacement) | Replace all substrings of the specified string value that match regexp with rep. |
regexp_replace(column, pattern, replacement) | Replace all substrings of the specified string value that match regexp with rep. |
unbase64(column) | Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64. |
rpad(column, len, pad) | Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
repeat(column, n: Int) | Repeats a string column n times, and returns it as a new string column. |
rtrim(column) | Trim the spaces from right end for the specified string value. |
rtrim(column, trimString) | Trim the specified character string from right end for the specified string column. |
soundex(column) | Returns the soundex code for the specified expression |
split(column, regex) | Splits str around matches of the given regex. |
split(column, regex, limit) | Splits str around matches of the given regex. |
substring(column, pos, len) | Substring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary type |
substring_index(column, delim, count) | Returns the substring from string str before count occurrences of the delimiter delim. * If count is positive, everything the left of the final delimiter (counting from left) is * returned. If count is negative, every to the right of the final delimiter (counting from the * right) is returned. substring_index performs a case-sensitive match when searching for delim. |
overlay(column, replaceString, pos, len) | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString` and proceeding for `len` bytes. |
overlay(column, replaceString, pos) | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString`. |
translate(column, matchingString, replaceString) | * The characters in replaceString correspond to the characters in matchingStrinTranslate any character in the src by a character in replaceString. g. * The translate will happen when any character in the string matches the character * in the `matchingString`. |
trim(column) | Trim the spaces from both ends for the specified string column. |
trim(column, trimString) | Trim the specified character from both ends for the specified string column. |
upper(column) | Converts a string column to upper case. |
Date and Time Functions
These Date and Time functions enable users to perform operations like date operations, formatting, extracting components (e.g., year, month), and converting between different date formats. These support various date and time units, including days, months, years, hours, minutes, and seconds.
The default date format is “yyyy-MM-dd” for date data types. When you work with date columns in Spark DataFrames, dates are typically represented and manipulated using this default format.
Timestamps represent date and time together. The default timestamp format in Spark is “yyyy-MM-dd HH:mm:ss.” This format includes both date and time components, with hours (HH), minutes (mm), and seconds (ss) represented in the 24-hour clock format.
DATE FUNCTIONS | DESCRIPTION |
---|---|
current_date() | Returns the current date as a date column. |
date_format(dateExpr,format) | Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. |
to_date(column) | Converts the column into `DateType` by casting rules to `DateType`. |
to_date(column, format) | Converts the column into a `DateType` with a specified format |
add_months(column, numMonths) | Returns the date that is `numMonths` after `startDate`. |
date_add(column, days) date_sub(column, days) | Returns the date that is `days` days after `start` |
datediff(end, start) | Returns the number of days from `start` to `end`. |
months_between(end, start) | Returns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month. |
months_between(end, start, roundOff) | Returns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise. |
next_day(column, dayOfWeek) | Returns the first date which is later than the value of the `date` column that is on the specified day of the week. For example, `next_day(‘2015-07-27’, “Sunday”)` returns 2015-08-02 because that is the first Sunday after 2015-07-27. |
trunc(column, format) | Returns date truncated to the unit specified by the format. For example, `trunc(“2018-11-19 12:01:19”, “year”)` returns 2018-01-01 format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, ‘month’, ‘mon’, ‘mm’ to truncate by month |
date_trunc(format, timestamp) | Returns timestamp truncated to the unit specified by the format. For example, `date_trunc(“year”, “2018-11-19 12:01:19”)` returns 2018-01-01 00:00:00 format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, ‘month’, ‘mon’, ‘mm’ to truncate by month, ‘day’, ‘dd’ to truncate by day, Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’, ‘quarter’ |
year(column) | Extracts the year as an integer from a given date/timestamp/string |
quarter(column) | Extracts the quarter as an integer from a given date/timestamp/string. |
month(column) | Extracts the month as an integer from a given date/timestamp/string |
dayofweek(column) | Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday |
dayofmonth(column) | Extracts the day of the month as an integer from a given date/timestamp/string. |
dayofyear(column) | Extracts the day of the year as an integer from a given date/timestamp/string. |
weekofyear(column) | Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601 |
last_day(column) | Returns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015. |
from_unixtime(column) | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format. |
from_unixtime(column, f) | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format. |
unix_timestamp() | Returns the current Unix timestamp (in seconds) as a long |
unix_timestamp(column) | Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale. |
unix_timestamp(column, p) | Converts time string with given pattern to Unix timestamp (in seconds). |
TIMESTAMP FUNCTIONS | DESCRIPTION |
---|---|
current_timestamp () | Returns the current timestamp as a timestamp column |
hour(column) | Extracts the hours as an integer from a given date/timestamp/string. |
minute(column) | Extracts the minutes as an integer from a given date/timestamp/string. |
second(column) | Extracts the seconds as an integer from a given date/timestamp/string. |
to_timestamp(column) | Converts to a timestamp by casting rules to `TimestampType`. |
to_timestamp(column, fmt) | Converts time string with the given pattern to timestamp. |
SQL Collection Functions
Collection functions in Spark SQL are used when working with array and map columns in DataFrames.These functions enable users to perform various operations on array and map columns efficiently, such as filtering, transforming, aggregating, and accessing elements.
Array Functions | Description |
---|---|
array() | Creates a new array from the given input columns. |
array_contains() | Returns true if the array contains the given value. |
array_append() | Appends the element to the source array and returns an array containing all elements. The new element/column is added at the end of the array. |
array_insert() | Returns an array after adding the element at the specified position. |
arrays_overlap() | Returns true if a1 and a2 have at least one non-null element in common. |
array_distinct() | Removes duplicate values from the array. |
array_except() | Computes the set difference of the two arrays. |
array_intersect() | Computes the set intersection of the two arrays. |
array_join() | Concatenates the elements of the given array using a delimiter. |
array_max() | Returns the maximum value in the array. |
array_min() | Returns the minimum value in the array. |
array_size() | Returns the size of an array. |
array_position() | Returns the (1-based) index of the first occurrence of the given value in the array. |
array_remove() | Remove all null elements from the input array. |
array_compact() | Remove all null elements form the input array. |
array_prepend() | Returns an array after appending the array at the beginning. |
array_repeat() | Repeat an array n times. |
array_sort() | Sorts the input array in ascending order according to the natural ordering of the array elements. |
array_union() | Computes the union of the two arrays. |
element_at() | Returns element of the array at the given index. |
slice() | Returns sliced array. |
Map Functions | Description |
---|---|
map() | Creates a new map column. |
map_keys() | Returns an array containing the keys of the map. |
map_values() | Returns an array containing the values of the map. |
map_concat() | Merges maps specified in arguments. |
map_from_entries() | Returns a map from the given array of StructType entries. |
map_entries() | Returns an array of all StructType in the given map. |
explode() | Creates a new row for every key-value pair in the map by ignoring null & empty. It creates two new columns one for key and one for value. |
explode_outer() | Creates a new row for every key-value pair in the map including null & empty. It creates two new columns one for key and one for value. |
posexplode() | Creates a new row for each key-value pair in a map by ignoring null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row. |
posexplode_outer() | Creates a new row for each key-value pair in a map including null & empty. It also creates 3 columns “pos” to hold the position of the map element, “key” and “value” columns for every row. |
transform_keys() | Transforms map by applying a code to every key-value pair and returns a transformed map. |
transform_values() | Transforms map by applying functions to every key-value pair and returns a transformed map. |
map_zip_with() | Merges two maps into a single map. |
element_at() | Returns a value of a key in a map. |
size() | Returns the length of a map column. |
Log Functions
LOG MATH FUNCTIONS | DESCRIPTION |
---|---|
log(columnName) | Computes the natural logarithm of the given column. |
log(base, column) log(base, columnName) | Returns the first argument-base logarithm of the second argument. |
log10(column) log10(columnName) | Computes the logarithm of the given value in base 10. |
log1p(column) log1p(columnName) | Computes the natural logarithm of the given value plus one. |
log2(column) log2(columnName) | Computes the logarithm of the given column in base 2. |
Aggregate Functions
AGGREGATE FUNCTIONS | DESCRIPTION |
---|---|
approx_count_distinct(column) | Returns the count of distinct items in a group. |
approx_count_distinct(column, rsd) | Returns the count of distinct items in a group. |
avg(column) | Returns the average of values in the input column. |
collect_list(column) | Returns all values from an input column with duplicates. |
collect_set(column) | Returns all values from an input column with duplicate values .eliminated. |
corr(column1, column2) | Returns the Pearson Correlation Coefficient for two columns. |
count(e: Column) | Returns number of elements in a column. |
countDistinct(column, exprs: Column*) | Returns number of distinct elements in the columns. |
covar_pop(column1, column2) | Returns the population covariance for two columns. |
covar_samp(column1, column2) | Returns the sample covariance for two columns. |
first(column, ignoreNulls) | Returns the first element in a column when ignoreNulls is set to true, it returns first non null element. |
first(column) | Returns the first element in a column. |
grouping(column) | Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set. |
kurtosis(column) | Returns the kurtosis of the values in a group. |
last(column, ignoreNulls) | Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element. |
last(column) | Returns the last element in a column. |
max(column) | Returns the maximum value in a column. |
mean(column) | Alias for Avg. Returns the average of the values in a column. |
min(column) | Returns the minimum value in a column. |
skewness(column) | Returns the skewness of the values in a group. |
stddev(column) | alias for `stddev_samp`. |
stddev_samp(column) | Returns the sample standard deviation of values in a column. |
stddev_pop(column) | Returns the population standard deviation of the values in a column. |
sum(column) | Returns the sum of all values in a column. |
sumDistinct(column) | Returns the sum of all distinct values in a column. |
variance(column) | alias for `var_samp`. |
var_samp(column) | Returns the unbiased variance of the values in a column. |
var_pop(column) | returns the population variance of the values in a column. |
Window Functions
Spark SQL window functions are used for performing calculations across a group of rows, known as a window, within a DataFrame. These functions allow users to perform aggregations, ranking, and analytical calculations over partitions of data defined by a window specification. Window functions operate on a set of rows related to the current row, rather than the entire dataset.
WINDOW FUNCTIONS | DESCRIPTION |
---|---|
window(timeColumn, windowDuration, slideDuration, startTime) | Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive. |
window(timeColumn, windowDuration, slideDuration) | Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive. |
window(timeColumn, windowDuration) | Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive. |
Sorting Functions
SORT FUNCTIONS | DESCRIPTION |
---|---|
asc(columnName) | asc is used to specify the ascending order of the sorting column on DataFrame or DataSet |
asc_nulls_first(columnName) | Similar to asc but null values return first and then non-null values |
asc_nulls_last(columnName) | Similar to asc but non-null values return first and then null values |
desc(columnName) | desc is used to specify the descending order of the DataFrame or DataSet sorting column. |
desc_nulls_first(columnName) | Similar to desc but null values return first and then non-null values. |
desc_nulls_last(columnName) | Similar to desc but non-null values return first and then null values. |
Conclusion
In conclusion, this article has provided a comprehensive overview of Spark SQL functions, showcasing their versatility and power in data processing tasks within the Apache Spark framework. By using these functions you can efficiently manipulate and transform data using Spark SQL.
Happy Learning !!
Related Articles
- Spark Date and Time Functions
- Spark String Functions
- Spark Array Functions
- Spark Map Functions
- Spark Aggregate Functions
- Spark Window Functions
- Spark Sort Functions
Hi, Your content is great. But when i open any page and if you highlight which page it is from the list given on the left side list will be helpful. If you highlight the link on the left side, it will be great. Otherwise we have to manually search them. Thanks.