In Spark, you can use the length
function in combination with the substring
function to extract a substring of a certain length from a string column. In this article, we shall discuss the length function, substring in spark, and usage of length function in substring in spark
1. Quick Example
Here’s an example of how to use the length
function in combination with substring
in Spark Scala
import org.apache.spark.sql.functions.{substring, length}
// create a sample dataframe
val data = Seq(("hello world"), ("foo bar"), ("spark is great"))
val df = data.toDF("text")
// extract a substring of length equal to the length of the string
df.select(substring($"text", 1, length($"text")).alias("substring")).show()
//we extract a substring of length equal to the length of the string, resulting in the output:
+--------------+
| substring|
+--------------+
| hello world|
| foo bar|
|spark is great|
+--------------+
2. length() function
In Spark, the length()
function is used to return the length of a given string or binary column. It takes one argument, which is the input column name or expression.
The syntax for the length function is:
// Syntax of length
length(str: Column): Column
Where str
is the input column or string expression for which the length is to be calculated.
Here’s an example of how to use the length
function in Spark Scala:
import org.apache.spark.sql.functions.length
// create a sample dataframe
val data = Seq(("hello world"), ("foo bar"), ("spark is great"))
val df = data.toDF("text")
// calculate the length of the text column
df.select(length($"text").alias("text_length")).show()
In this example, we create a DataFrame with a single column named text
, and then use the length
function from the org.apache.spark.sql.functions
package to calculate the length of the text
column. The output of the show
method is:
+-----------+
|text_length|
+-----------+
| 11|
| 7|
| 13|
+-----------+
As you can see, the length
function returns the length of each string in the text
column.
3. Substring() Function
In Spark, the substring()
function is used to extract a part of a string based on the starting position and length.
The syntax for using substring()
function in Spark Scala is as follows:
// Syntax
substring(str: Column, pos: Int, len: Int): Column
Where str
is the input column or string expression, pos
is the starting position of the substring (starting from 1), and len
is the length of the substring.
Example usage:
import org.apache.spark.sql.functions.substring
val df = Seq(
("Hello"),
("Spark Scala"),
(""),
(null)
).toDF("text")
df.select(substring($"text", 2, 3)).show()
//Output
+---------------------+
|substring(text, 2, 3)|
+---------------------+
| ell |
| par |
| |
| null |
+---------------------+
Here, we are using the substring()
function to extract a substring of length 3 starting from the second position in the text
column of the DataFrame df
. The resulting DataFrame will contain the extracted substrings for each string in the text
column.
Note: If the starting position pos
is greater than the length of the string, an empty string will be returned. If the length len
is negative, an IllegalArgumentException will be thrown.
4. Using the length function in substring in spark
We can use the length()
function in conjunction with the substring()
function in Spark Scala to extract a substring of variable length.
Let us try their conjunction with some examples
4.1 Example1: Extract the last 3 characters of a string
import org.apache.spark.sql.functions.{substring, length}
val df = Seq(
("Hello"),
("Spark Scala"),
(""),
(null)
).toDF("text")
df.select(substring($"text", length($"text") - 2, 3)).show()
//Output
+--------------------------------------+
|substring(text, (length(text) - 2), 3)|
+--------------------------------------+
| llo |
| ala |
| |
| null |
+--------------------------------------+
Here, For the length function in substring in spark we are using the length()
function to calculate the length of the string in the text
column, and then subtract 2 from it to get the starting position of the last 3 characters. We are then using the resulting value as the first argument in the substring()
function to extract the last 3 characters of the string.
4.2 Example2: Extract all characters after a certain position
import org.apache.spark.sql.functions.{substring, length}
val df = Seq(
("Hello"),
("Spark Scala"),
(""),
(null)
).toDF("text")
df.select(substring($"text", 4, length($"text"))).show()
//Output
+--------------------------------------+
|substring(text, (length(text) - 2), 3)|
+--------------------------------------+
| lo |
| k Scala |
| |
| null |
+--------------------------------------+
Here, the length()
function to calculate the length of the string in the text
column, and using it as the second argument in the substring()
function to extract all characters after the third position.
Note that when using length()
as the second argument, we don’t need to subtract 1 from it because the ending position of the substring is inclusive.
5. Conclusion
In conclusion, the length()
function in conjunction with the substring()
function in Spark Scala is a powerful tool for extracting substrings of variable length from a string column in a DataFrame. Overall, the length()
and substring()
functions are powerful tools for manipulating string data in Spark Scala, and can be used in a wide range of applications, from data cleaning and preprocessing to feature engineering and model building.
Related Articles
- Spark Window Functions with Examples
- Spark DataFrame Tutorial with Examples
- Spark Using Length/Size Of a DataFrame Column
- Spark RDD filter() with examples
- Filter Spark DataFrame Based on Date
- Spark DataFrame withColumn
- Spark – How to Change Column Type?
- Spark Filter Using contains() Examples
- Null values in concat() of Spark