Spark SQL String Functions Explained

Spark SQL String Functions
Spark SQL String Functions

Spark SQL defines built-in standard String functions in DataFrame API, these String functions come in handy when we need to make operations on Strings. In this article, we will learn the usage of some functions with scala example. You can access the standard functions using the following import statement.


// Import spark sql functions
import  org.apache.spark.sql.functions._

When possible try to leverage standard library functions as they are little bit more compile-time safety, handles null and performs better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions at all costs as these are not guarantee on performance.

Related: If you are looking for PySpark, please refer to PySpark SQL String Functions

Spark SQL String Functions:

Click on each link from below table for more explanation and working examples of String Function with Scala example.

STRING FUNCTION SIGNATURESTRING FUNCTION DESCRIPTION
ascii(e: Column): ColumnComputes the numeric value of the first character of the string column, and returns the result as an int column.
base64(e: Column): ColumnComputes the BASE64 encoding of a binary column and returns it as a string column.This is the reverse of unbase64.
concat_ws(sep: String, exprs: Column*): ColumnConcatenates multiple input string columns together into a single string column, using the given separator.
decode(value: Column, charset: String): ColumnComputes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
encode(value: Column, charset: String): ColumnComputes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
format_number(x: Column, d: Int): ColumnFormats numeric column x to a format like ‘#,###,###.##’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
format_string(format: String, arguments: Column*): ColumnFormats the arguments in printf-style and returns the result as a string column.
initcap(e: Column): ColumnReturns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, “hello world” will become “Hello World”.
instr(str: Column, substring: String): ColumnLocate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
length(e: Column): ColumnComputes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
lower(e: Column): ColumnConverts a string column to lower case.
levenshtein ( l : Column , r : Column ) : ColumnComputes the Levenshtein distance of the two given string columns.
locate(substr: String, str: Column): ColumnLocate the position of the first occurrence of substr.
locate(substr: String, str: Column, pos: Int): ColumnLocate the position of the first occurrence of substr in a string column, after position pos.
lpad(str: Column, len: Int, pad: String): ColumnLeft-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
ltrim(e: Column): ColumnTrim the spaces from left end for the specified string value.
regexp_extract(e: Column, exp: String, groupIdx: Int): ColumnExtract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.
regexp_replace(e: Column, pattern: String, replacement: String): ColumnReplace all substrings of the specified string value that match regexp with rep.
regexp_replace(e: Column, pattern: Column, replacement: Column): ColumnReplace all substrings of the specified string value that match regexp with rep.
unbase64(e: Column): ColumnDecodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.
rpad(str: Column, len: Int, pad: String): ColumnRight-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
repeat(str: Column, n: Int): ColumnRepeats a string column n times, and returns it as a new string column.
rtrim(e: Column): ColumnTrim the spaces from right end for the specified string value.
rtrim(e: Column, trimString: String): ColumnTrim the specified character string from right end for the specified string column.
soundex(e: Column): ColumnReturns the soundex code for the specified expression
split(str: Column, regex: String): ColumnSplits str around matches of the given regex.
split(str: Column, regex: String, limit: Int): ColumnSplits str around matches of the given regex.
substring(str: Column, pos: Int, len: Int): ColumnSubstring starts at `pos` and is of length `len` when str is String type or returns the slice of byte array that starts at `pos` in byte and is of length `len` when str is Binary type
substring_index(str: Column, delim: String, count: Int): ColumnReturns the substring from string str before count occurrences of the delimiter delim.
* If count is positive, everything the left of the final delimiter (counting from left) is
* returned. If count is negative, every to the right of the final delimiter (counting from the
* right) is returned. substring_index performs a case-sensitive match when searching for delim.
overlay(src: Column, replaceString: String, pos: Int, len: Int): ColumnOverlay the specified portion of `src` with `replaceString`,
* starting from byte position `pos` of `inputString` and proceeding for `len` bytes.
overlay(src: Column, replaceString: String, pos: Int): ColumnOverlay the specified portion of `src` with `replaceString`,
* starting from byte position `pos` of `inputString`.
translate(src: Column, matchingString: String, replaceString: String): Column* The characters in replaceString correspond to the characters in matchingStrinTranslate any character in the src by a character in replaceString.
g.
* The translate will happen when any character in the string matches the character
* in the `matchingString`.
trim(e: Column): ColumnTrim the spaces from both ends for the specified string column.
trim(e: Column, trimString: String): ColumnTrim the specified character from both ends for the specified string column.
upper(e: Column): ColumnConverts a string column to upper case.

Conclusion:

In this post, I’ve consolidated the complete list of Spark SQL String functions with a description and example of some commonly used functions. You can find more information about these at the following blog

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply