Spark org.apache.spark.sql.functions.regexp_replace
is a string function that is used to replace part of a string (substring) value with another string on DataFrame column by using gular expression (regex). This function returns a org.apache.spark.sql.Column type after replacing a string value.
In this article, I will explain the syntax, usage of regexp_replace()
function, and how to replace a string or part of a string with another string literal or value of another column.
For PySpark example please refer to PySpark regexp_replace() Usage Example
1. Spark Replace String Value
1.1 Spark regexp_replace() Syntax
Following is a syntax of regexp_replace()
function.
regexp_replace(e : Column, pattern : String, replacement :String) : Column
regexp_replace(e : Column, pattern : Column, replacement : Column) : Column
regexp_replace() has two signatues one that takes string value for pattern and replacement and anohter that takes DataFrame columns.
1.2 Spark regexp_replace() Example
Let’s create a Spark DataFrame with some addresses and states, will use this DataFrame to explain how to replace part of a string with another string of DataFrame column values.
//Create DataFrame
val address = Seq((1,"14851 Jeffrey Rd","DE"),
(2,"43421 Margarita St","NY"),
(3,"13111 Siemon Ave","CA"))
import spark.implicits._
val df = address.toDF("id","address","state")
By using regexp_replace()
Spark function you can replace a column’s string value with another string/substring. regexp_replace()
uses Java regex for matching, if the regex does not match it returns an empty string. The below example replaces the street name Rd
value with Road
string on address
column.
//Replace part of a string using regexp_replace()
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("address",
regexp_replace($"address", "Rd", "Road"))
.show()
//+---+------------------+-----+
//| id| address|state|
//+---+------------------+-----+
//| 1|14851 Jeffrey Road| DE|
//| 2|43421 Margarita St| NY|
//| 3| 13111 Siemon Ave| CA|
//+---+------------------+-----+
Above, we just replaced Rd
with Road
, but not replaced St
and Ave
values on address column, let’s see how to replace column values conditionally in Spark Dataframe by using when().otherwise() SQL condition function.
//Replace string based on condition
import org.apache.spark.sql.functions.when
import spark.implicits._
df.withColumn("address",
when($"address".endsWith("Rd"),regexp_replace($"address","Rd","Road"))
.when($"address"endsWith("St"),regexp_replace($"address","St","Street"))
.when($"address"endsWith("Ave"),regexp_replace($"address","Ave","Avenue"))
.otherwise("address"))
.show(false)
//+---+----------------------+-----+
//|id |address |state|
//+---+----------------------+-----+
//|1 |14851 Jeffrey Road |DE |
//|2 |43421 Margarita Street|NY |
//|3 |13111 Siemon Avenue |CA |
//+---+----------------------+-----+
You can also replace column values from the map (key-value pair). In the below example, we replace the string value of the state
column with the full abbreviated name from a map
by using Spark map() transformation
//Replace string from a map
val stateDic=Map("CA"->"California","NY"->"New York","DE"->"Delaware")
val df3=df.rdd.map(x=>{
(x.getAs[Integer]("id"),
x.getAs[String]("address"),
stateDic(x.getAs[String]("state")))
}).toDF("id","address","state")
df3.show()
//+---+------------------+----------+
//| id| address| state|
//+---+------------------+----------+
//| 1| 14851 Jeffrey Rd| Delaware|
//| 2|43421 Margarita St| New York|
//| 3| 13111 Siemon Ave|California|
//+---+------------------+----------+
Below example replaces a value with another string column.
//Replace part of a string with another column
val data= Seq(("ABCDE_XYZ", "XYZ","FGH"))
val df2= data.toDF("col1", "col2","col3")
df2.withColumn("new_column",
regexp_replace($"col1", $"col2", $"col3")
.alias("replaced_value")
).show()
//+---------+----+----+----------+
//| col1|col2|col3|new_column|
//+---------+----+----+----------+
//|ABCDE_XYZ| XYZ| FGH| ABCDE_FGH|
//+---------+----+----+----------+
2. Spark SQL Replace String Value
Similarly let’s see how to replace part of a string with another string using regexp_replace() on Spark SQL query expression.
//Using with Spark SQL
df.createOrReplaceTempView("TAB")
spark.sql("select id,regexp_replace(address, " +
"'Rd', 'Road') as address from TAB").show()
3. PySpark regexp_replace() Example
let’s create a PySpark DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
address = [(1,"14851 Jeffrey Rd","DE"),
(2,"43421 Margarita St","NY"),
(3,"13111 Siemon Ave","CA")]
df =spark.createDataFrame(address,["id","address","state"])
df.show()
by using regexp_replace() replace part of a string value with another string.
#Replace part of string with another string
from pyspark.sql.functions import regexp_replace
df.withColumn('address', regexp_replace('address', 'Rd', 'Road')) \
.show(truncate=False)
#+---+------------------+-----+
#|id |address |state|
#+---+------------------+-----+
#|1 |14851 Jeffrey Road|DE |
#|2 |43421 Margarita St|NY |
#|3 |13111 Siemon Ave |CA |
#+---+------------------+-----+
Using conditional replace.
#Replace string column value conditionally
from pyspark.sql.functions import when
df.withColumn('address',
when(df.address.endswith('Rd'),regexp_replace(df.address,'Rd','Road')) \
.when(df.address.endswith('St'),regexp_replace(df.address,'St','Street')) \
.when(df.address.endswith('Ave'),regexp_replace(df.address,'Ave','Avenue')) \
.otherwise(df.address)) \
.show(truncate=False)
#+---+----------------------+-----+
#|id |address |state|
#+---+----------------------+-----+
#|1 |14851 Jeffrey Road |DE |
#|2 |43421 Margarita Street|NY |
#|3 |13111 Siemon Avenue |CA |
#+---+----------------------+-----+
Conclusion
In this article you have learned how to use regexp_replace() function that is used to replace part of a string with another string, replace conditionally using Scala, Python and SQL Query.
Happy Learning !!
Related Articles
- Spark Replace Empty Value With NULL on DataFrame
- Spark Replace NULL Values on DataFrame
- Spark createOrReplaceTempView() Explained
- Spark SQL Explained with Examples
- Spark groupByKey()
- Spark JDBC Parallel Read
- Read JDBC Table to Spark DataFrame