You are currently viewing Use length function in substring in Spark

In Spark, you can use the length function in combination with the substring function to extract a substring of a certain length from a string column. In this article, we shall discuss the length function, substring in spark, and usage of length function in substring in spark

Advertisements

1. Quick Example

Here’s an example of how to use the length function in combination with substring in Spark Scala


import org.apache.spark.sql.functions.{substring, length}

// create a sample dataframe
val data = Seq(("hello world"), ("foo bar"), ("spark is great"))
val df = data.toDF("text")

// extract a substring of length equal to the length of the string
df.select(substring($"text", 1, length($"text")).alias("substring")).show()

//we extract a substring of length equal to the length of the string, resulting in the output:
+--------------+
|     substring|
+--------------+
|   hello world|
|       foo bar|
|spark is great|
+--------------+

2. length() function

In Spark, the length() function is used to return the length of a given string or binary column. It takes one argument, which is the input column name or expression.

The syntax for the length function is:


// Syntax of length
length(str: Column): Column

Where str is the input column or string expression for which the length is to be calculated.

Here’s an example of how to use the length function in Spark Scala:


import org.apache.spark.sql.functions.length

// create a sample dataframe
val data = Seq(("hello world"), ("foo bar"), ("spark is great"))
val df = data.toDF("text")

// calculate the length of the text column
df.select(length($"text").alias("text_length")).show()

In this example, we create a DataFrame with a single column named text, and then use the length function from the org.apache.spark.sql.functions package to calculate the length of the text column. The output of the show method is:


+-----------+
|text_length|
+-----------+
|         11|
|          7|
|         13|
+-----------+

As you can see, the length function returns the length of each string in the text column.

3. Substring() Function

In Spark, the substring() function is used to extract a part of a string based on the starting position and length.

The syntax for using substring() function in Spark Scala is as follows:


// Syntax
substring(str: Column, pos: Int, len: Int): Column

Where str is the input column or string expression, pos is the starting position of the substring (starting from 1), and len is the length of the substring.

Example usage:


import org.apache.spark.sql.functions.substring

val df = Seq(
  ("Hello"),
  ("Spark Scala"),
  (""),
  (null)
).toDF("text")

df.select(substring($"text", 2, 3)).show()

//Output
+---------------------+
|substring(text, 2, 3)|
+---------------------+
|       ell           |
|       par           |
|                     |
|       null          |
+---------------------+

Here, we are using the substring() function to extract a substring of length 3 starting from the second position in the text column of the DataFrame df. The resulting DataFrame will contain the extracted substrings for each string in the text column.

Note: If the starting position pos is greater than the length of the string, an empty string will be returned. If the length len is negative, an IllegalArgumentException will be thrown.

4. Using the length function in substring in spark

We can use the length() function in conjunction with the substring() function in Spark Scala to extract a substring of variable length.

Let us try their conjunction with some examples

4.1 Example1: Extract the last 3 characters of a string


import org.apache.spark.sql.functions.{substring, length}

val df = Seq(
  ("Hello"),
  ("Spark Scala"),
  (""),
  (null)
).toDF("text")

df.select(substring($"text", length($"text") - 2, 3)).show()

//Output
+--------------------------------------+
|substring(text, (length(text) - 2), 3)|
+--------------------------------------+
|              llo                     |
|              ala                     |
|                                      |
|             null                     |
+--------------------------------------+

Here, For the length function in substring in spark we are using the length() function to calculate the length of the string in the text column, and then subtract 2 from it to get the starting position of the last 3 characters. We are then using the resulting value as the first argument in the substring() function to extract the last 3 characters of the string.

4.2 Example2: Extract all characters after a certain position


import org.apache.spark.sql.functions.{substring, length}

val df = Seq(
  ("Hello"),
  ("Spark Scala"),
  (""),
  (null)
).toDF("text")

df.select(substring($"text", 4, length($"text"))).show()


//Output
+--------------------------------------+
|substring(text, (length(text) - 2), 3)|
+--------------------------------------+
|              lo                      |
|              k Scala                 |
|                                      |
|              null                    |
+--------------------------------------+

Here, the length() function to calculate the length of the string in the text column, and using it as the second argument in the substring() function to extract all characters after the third position.

Note that when using length() as the second argument, we don’t need to subtract 1 from it because the ending position of the substring is inclusive.

5. Conclusion

In conclusion, the length() function in conjunction with the substring() function in Spark Scala is a powerful tool for extracting substrings of variable length from a string column in a DataFrame. Overall, the length() and substring() functions are powerful tools for manipulating string data in Spark Scala, and can be used in a wide range of applications, from data cleaning and preprocessing to feature engineering and model building.

Related Articles