
Spark SQL defines built-in standard String functions in DataFrame API, these String functions come in handy when we need to make operations on Strings. In this article, we will learn the usage of some functions with scala example. You can access the standard functions using the following import statement.
// Import spark sql functions
import org.apache.spark.sql.functions._
When possible try to leverage standard library functions as they are little bit more compile-time safety, handles null and performs better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions at all costs as these are not guarantee on performance.
Spark SQL String Functions:
Click on each link from below table for more explanation and working examples of String Function with Scala example.
String Function Signature | String Function Description |
---|---|
ascii(e: Column): Column | Computes the numeric value of the first character of the string column, and returns the result as an int column. |
base64(e: Column): Column | Computes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64. |
concat_ws(sep: String, exprs: Column*): Column | Concatenates multiple input string columns together into a single string column, using the given separator. |
decode(value: Column, charset: String): Column | Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). |
encode(value: Column, charset: String): Column | Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). |
format_number(x: Column, d: Int): Column | Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column. |
format_string(format: String, arguments: Column*): Column | Formats the arguments in printf-style and returns the result as a string column. |
initcap(e: Column): Column | Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, "hello world" will become "Hello World". |
instr(str: Column, substring: String): Column | Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null. |
length(e: Column): Column | Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros. |
lower(e: Column): Column | Converts a string column to lower case. |
levenshtein ( l : Column , r : Column ) : Column | Computes the Levenshtein distance of the two given string columns. |
locate(substr: String, str: Column): Column | Locate the position of the first occurrence of substr. |
locate(substr: String, str: Column, pos: Int): Column | Locate the position of the first occurrence of substr in a string column, after position pos. |
lpad(str: Column, len: Int, pad: String): Column | Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
ltrim(e: Column): Column | Trim the spaces from left end for the specified string value. |
regexp_extract(e: Column, exp: String, groupIdx: Int): Column | Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. |
regexp_replace(e: Column, pattern: String, replacement: String): Column | Replace all substrings of the specified string value that match regexp with rep. |
regexp_replace(e: Column, pattern: Column, replacement: Column): Column | Replace all substrings of the specified string value that match regexp with rep. |
unbase64(e: Column): Column | Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64. |
rpad(str: Column, len: Int, pad: String): Column | Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. |
repeat(str: Column, n: Int): Column | Repeats a string column n times, and returns it as a new string column. |
rtrim(e: Column): Column | Trim the spaces from right end for the specified string value. |
rtrim(e: Column, trimString: String): Column | Trim the specified character string from right end for the specified string column. |
soundex(e: Column): Column | Returns the soundex code for the specified expression |
split(str: Column, regex: String): Column | Splits str around matches of the given regex. |
split(str: Column, regex: String, limit: Int): Column | Splits str around matches of the given regex. |
substring(str: Column, pos: Int, len: Int): Column | Substring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary type |
substring_index(str: Column, delim: String, count: Int): Column | Returns the substring from string str before count occurrences of the delimiter delim. * If count is positive, everything the left of the final delimiter (counting from left) is * returned. If count is negative, every to the right of the final delimiter (counting from the * right) is returned. substring_index performs a case-sensitive match when searching for delim. |
overlay(src: Column, replaceString: String, pos: Int, len: Int): Column | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString` and proceeding for `len` bytes. |
overlay(src: Column, replaceString: String, pos: Int): Column | Overlay the specified portion of `src` with `replaceString`, * starting from byte position `pos` of `inputString`. |
translate(src: Column, matchingString: String, replaceString: String): Column | * The characters in replaceString correspond to the characters in matchingStrinTranslate any character in the src by a character in replaceString. g. * The translate will happen when any character in the string matches the character * in the `matchingString`. |
trim(e: Column): Column | Trim the spaces from both ends for the specified string column. |
trim(e: Column, trimString: String): Column | Trim the specified character from both ends for the specified string column. |
upper(e: Column): Column | Converts a string column to upper case. |
Conclusion:
In this post, I’ve consolidated the complete list of Spark SQL String functions with a description and example of some commonly used functions. You can find more information about these at the following blog
Happy Learning !!