Spark regexp_replace() – Replace String Value

Spark org.apache.spark.sql.functions.regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org.apache.spark.sql.Column type after replacing a string value.

In this article, I will explain the syntax, usage of regexp_replace() function, and how to replace a string or part of a string with another string literal or value of another column.

For PySpark example please refer to PySpark regexp_replace() Usage Example

1. Spark Replace String Value

1.1 Spark regexp_replace() Syntax

Following is a syntax of regexp_replace() function.


regexp_replace(e : Column, pattern : String, replacement :String) : Column
regexp_replace(e : Column, pattern : Column, replacement : Column) : Column

regexp_replace() has two signatures one that takes string value for pattern and replacement and another that takes DataFrame columns.

1.2 Spark regexp_replace() Example

Let’s create a Spark DataFrame with some addresses and states, will use this DataFrame to explain how to replace part of a string with another string of DataFrame column values.



//Create DataFrame
val address = Seq((1,"14851 Jeffrey Rd","DE"),
(2,"43421 Margarita St","NY"),
(3,"13111 Siemon Ave","CA"))
import spark.implicits._
val df = address.toDF("id","address","state")

By using regexp_replace() Spark function you can replace a column’s string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string. The below example replaces the street name Rd value with Road string on address column.


//Replace part of a string using regexp_replace()
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("address",
  regexp_replace($"address", "Rd", "Road"))
  .show()

//+---+------------------+-----+
//| id|           address|state|
//+---+------------------+-----+
//|  1|14851 Jeffrey Road|   DE|
//|  2|43421 Margarita St|   NY|
//|  3|  13111 Siemon Ave|   CA|
//+---+------------------+-----+

Above, we just replaced Rd with Road, but not replaced St and Ave values on address column, let’s see how to replace column values conditionally in Spark Dataframe by using when().otherwise() SQL condition function.


//Replace string based on condition
import org.apache.spark.sql.functions.when
import spark.implicits._
df.withColumn("address",
   when($"address".endsWith("Rd"),regexp_replace($"address","Rd","Road"))
  .when($"address"endsWith("St"),regexp_replace($"address","St","Street"))
  .when($"address"endsWith("Ave"),regexp_replace($"address","Ave","Avenue"))
  .otherwise("address"))
  .show(false)

//+---+----------------------+-----+
//|id |address               |state|
//+---+----------------------+-----+
//|1  |14851 Jeffrey Road    |DE   |
//|2  |43421 Margarita Street|NY   |
//|3  |13111 Siemon Avenue   |CA   |
//+---+----------------------+-----+

You can also replace column values from the map (key-value pair). In the below example, we replace the string value of the state column with the full abbreviated name from a map by using Spark map() transformation


//Replace string from a map
val stateDic=Map("CA"->"California","NY"->"New York","DE"->"Delaware")
val df3=df.rdd.map(x=>{
    (x.getAs[Integer]("id"),
      x.getAs[String]("address"),
      stateDic(x.getAs[String]("state")))
}).toDF("id","address","state")

df3.show()
//+---+------------------+----------+
//| id|           address|     state|
//+---+------------------+----------+
//|  1|  14851 Jeffrey Rd|  Delaware|
//|  2|43421 Margarita St|  New York|
//|  3|  13111 Siemon Ave|California|
//+---+------------------+----------+

Below example replaces a value with another string column.


//Replace part of a string with another column
val data= Seq(("ABCDE_XYZ", "XYZ","FGH"))
val df2= data.toDF("col1", "col2","col3")

df2.withColumn("new_column",
    regexp_replace($"col1", $"col2", $"col3")
      .alias("replaced_value")
).show()

//+---------+----+----+----------+
//|     col1|col2|col3|new_column|
//+---------+----+----+----------+
//|ABCDE_XYZ| XYZ| FGH| ABCDE_FGH|
//+---------+----+----+----------+

2. Spark SQL Replace String Value

Similarly let’s see how to replace part of a string with another string using regexp_replace() on Spark SQL query expression.


//Using with Spark SQL
df.createOrReplaceTempView("TAB")
spark.sql("select id,regexp_replace(address, " +
    "'Rd', 'Road') as address from TAB").show()

Conclusion

In this article you have learned how to use regexp_replace() function that is used to replace part of a string with another string, replace conditionally using Scala, Python and SQL Query.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply