Spark regexp_replace() – Replace String Value

Spark org.apache.spark.sql.functions.regexp_replace is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org.apache.spark.sql.Column type after replacing a string value.

In this article, I will explain the syntax, usage of regexp_replace() function, and how to replace a string or part of a string with another string literal or value of another column.

For PySpark example please refer to PySpark regexp_replace() Usage Example

1. Spark Replace String Value

1.1 Spark regexp_replace() Syntax

Following is a syntax of regexp_replace() function.


regexp_replace(e : Column, pattern : String, replacement :String) : Column
regexp_replace(e : Column, pattern : Column, replacement : Column) : Column

regexp_replace() has two signatues one that takes string value for pattern and replacement and anohter that takes DataFrame columns.

1.2 Spark regexp_replace() Example

Let’s create a Spark DataFrame with some addresses and states, will use this DataFrame to explain how to replace part of a string with another string of DataFrame column values.


//Create DataFrame
val address = Seq((1,"14851 Jeffrey Rd","DE"),
(2,"43421 Margarita St","NY"),
(3,"13111 Siemon Ave","CA"))
import spark.implicits._
val df = address.toDF("id","address","state")

By using regexp_replace() Spark function you can replace a column’s string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string. The below example replaces the street name Rd value with Road string on address column.


//Replace part of a string using regexp_replace()
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("address",
  regexp_replace($"address", "Rd", "Road"))
  .show()

//+---+------------------+-----+
//| id|           address|state|
//+---+------------------+-----+
//|  1|14851 Jeffrey Road|   DE|
//|  2|43421 Margarita St|   NY|
//|  3|  13111 Siemon Ave|   CA|
//+---+------------------+-----+

Above, we just replaced Rd with Road, but not replaced St and Ave values on address column, let’s see how to replace column values conditionally in Spark Dataframe by using when().otherwise() SQL condition function.


//Replace string based on condition
import org.apache.spark.sql.functions.when
import spark.implicits._
df.withColumn("address",
   when($"address".endsWith("Rd"),regexp_replace($"address","Rd","Road"))
  .when($"address"endsWith("St"),regexp_replace($"address","St","Street"))
  .when($"address"endsWith("Ave"),regexp_replace($"address","Ave","Avenue"))
  .otherwise("address"))
  .show(false)

//+---+----------------------+-----+
//|id |address               |state|
//+---+----------------------+-----+
//|1  |14851 Jeffrey Road    |DE   |
//|2  |43421 Margarita Street|NY   |
//|3  |13111 Siemon Avenue   |CA   |
//+---+----------------------+-----+

You can also replace column values from the map (key-value pair). In the below example, we replace the string value of the state column with the full abbreviated name from a map by using Spark map() transformation


//Replace string from a map
val stateDic=Map("CA"->"California","NY"->"New York","DE"->"Delaware")
val df3=df.rdd.map(x=>{
    (x.getAs[Integer]("id"),
      x.getAs[String]("address"),
      stateDic(x.getAs[String]("state")))
}).toDF("id","address","state")

df3.show()
//+---+------------------+----------+
//| id|           address|     state|
//+---+------------------+----------+
//|  1|  14851 Jeffrey Rd|  Delaware|
//|  2|43421 Margarita St|  New York|
//|  3|  13111 Siemon Ave|California|
//+---+------------------+----------+

Below example replaces a value with another string column.


//Replace part of a string with another column
val data= Seq(("ABCDE_XYZ", "XYZ","FGH"))
val df2= data.toDF("col1", "col2","col3")

df2.withColumn("new_column",
    regexp_replace($"col1", $"col2", $"col3")
      .alias("replaced_value")
).show()

//+---------+----+----+----------+
//|     col1|col2|col3|new_column|
//+---------+----+----+----------+
//|ABCDE_XYZ| XYZ| FGH| ABCDE_FGH|
//+---------+----+----+----------+

2. Spark SQL Replace String Value

Similarly let’s see how to replace part of a string with another string using regexp_replace() on Spark SQL query expression.


//Using with Spark SQL
df.createOrReplaceTempView("TAB")
spark.sql("select id,regexp_replace(address, " +
    "'Rd', 'Road') as address from TAB").show()

3. PySpark regexp_replace() Example

let’s create a PySpark DataFrame


from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
address = [(1,"14851 Jeffrey Rd","DE"),
    (2,"43421 Margarita St","NY"),
    (3,"13111 Siemon Ave","CA")]
df =spark.createDataFrame(address,["id","address","state"])
df.show()

by using regexp_replace() replace part of a string value with another string.


#Replace part of string with another string
from pyspark.sql.functions import regexp_replace
df.withColumn('address', regexp_replace('address', 'Rd', 'Road')) \
  .show(truncate=False)

#+---+------------------+-----+
#|id |address           |state|
#+---+------------------+-----+
#|1  |14851 Jeffrey Road|DE   |
#|2  |43421 Margarita St|NY   |
#|3  |13111 Siemon Ave  |CA   |
#+---+------------------+-----+

Using conditional replace.


#Replace string column value conditionally
from pyspark.sql.functions import when
df.withColumn('address', 
    when(df.address.endswith('Rd'),regexp_replace(df.address,'Rd','Road')) \
   .when(df.address.endswith('St'),regexp_replace(df.address,'St','Street')) \
   .when(df.address.endswith('Ave'),regexp_replace(df.address,'Ave','Avenue')) \
   .otherwise(df.address)) \
   .show(truncate=False)

#+---+----------------------+-----+
#|id |address               |state|
#+---+----------------------+-----+
#|1  |14851 Jeffrey Road    |DE   |
#|2  |43421 Margarita Street|NY   |
#|3  |13111 Siemon Avenue   |CA   |
#+---+----------------------+-----+

Conclusion

In this article you have learned how to use regexp_replace() function that is used to replace part of a string with another string, replace conditionally using Scala, Python and SQL Query.

Happy Learning !!

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Spark regexp_replace() – Replace String Value