• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:37 mins read
You are currently viewing PySpark String Functions with Examples

pyspark.sql.functions module provides string functions to work with strings for manipulation and data processing. String functions can be applied to string columns or literals to perform various operations such as concatenation, substring extraction, padding, case conversions, and pattern matching with regular expressions.

In this article, I will explain the most used string functions I come across in my real-time projects with examples. When possible, try to leverage the functions from standard libraries (pyspark.sql.functions) as they are a little bit safer in compile-time, handle null, and perform better when compared to UDFs. If your application is critical on performance, try to avoid using custom UDF at all costs as UDF are not guaranteed on performance.

PySpark String Functions

The following table shows the most used string functions in PySpark.

String FunctionDefinition
ascii(col)Calculates the numerical value corresponding to the first character of the string column.
base64(col)Generates the BASE64 encoding of a binary column and outputs it as a string column.
bit_length(col)Determines the bit length of the specified string column.
btrim(str[, trim])Trim characters at the beginning and end of the string ‘str’ are removed.
char(col)Produces the ASCII character corresponding to the binary representation of the ‘col’ column.
character_length(str)Provides the length of characters for string data or the number of bytes for binary data.
char_length(str)Outputs the length of characters for string data or the byte count for binary data.
concat_ws(sep, *cols)Combines multiple input string columns into a unified string column using the specified separator.
contains(left, right)Returns a boolean.
decode(col, charset)Converts the first argument from binary to string using the specified character set, which can be one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, or ‘UTF-16’.
elt(*inputs)Returns the n-th input, e.g., returns input2 when n is 2.
encode(col, charset)Converts the first argument from a string to binary using the specified character set, which can be one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, or ‘UTF-16’.
endswith(str, suffix)Returns a boolean.
find_in_set(str, str_array)Provides the 1-based index of the specified string (str) in the comma-delimited list (strArray).
format_number(col, d)Formats the number X to a pattern like ‘#,–#,–#.–’, rounding to d decimal places using the HALF_EVEN round mode, and outputs the result as a string.
format_string(format, *cols)Applies printf-style formatting to the provided arguments and outputs the result as a string column.
ilike(str, pattern[, escapeChar])Returns true if the string ‘str’ matches the pattern with case-insensitive escape handling; returns null if any arguments are null, and false otherwise.
initcap(col)Capitalize the initial letter of each word in the sentence.
instr(str, substr)Find the position of the first occurrence of the ‘substr’ column in the provided string.
lcase(str)Converts all characters in the string ‘str’ to lowercase.
length(col)Calculates the length of characters for string data or the byte count for binary data.
like(str, pattern[, escapeChar])Returns true if the string ‘str’ matches the pattern with escape handling, null if any arguments are null, and false otherwise.
lower(col)Converts a string expression to lower case.
left(str, len)Outputs the leftmost ‘len’ (where ‘len’ can be of string type) characters from the string ‘str’; if ‘len’ is less than or equal to 0, the result is an empty string.
levenshtein(left, right[, threshold])Calculates the Levenshtein distance between the two provided strings.
locate(substr, str[, pos])Find the position of the first occurrence of ‘substr’ in a string column, starting after the specified position ‘pos’.
lpad(col, len, pad)Pad the string column on the left side with the specified padding ‘pad’ to achieve the width ‘len’.
ltrim(col)Remove leading spaces from the given string value.
mask(col[, upperChar, lowerChar, digitChar, …])Conceals the provided string value.
octet_length(col)Computes the byte length of the specified string column.
parse_url(url, partToExtract[, key])Extracts a part from a URL.
position(substr, str[, start])Provides the position of the first occurrence of ‘substr’ in ‘str’ after the specified position ‘start’.
printf(format, *cols)Applies printf-style formatting to the given arguments and outputs the result as a string column.
rlike(str, regexp)Returns true if the string ‘str’ matches the Java regex ‘regexp’, or false otherwise.
regexp(str, regexp)Indicates whether the string ‘str’ matches the Java regex ‘regexp’ (returns true) or not (returns false).
regexp_like(str, regexp)Determines if the string ‘str’ matches the Java regex ‘regexp’ (true) or not (false).
regexp_count(str, regexp)Provides the count of occurrences where the Java regex pattern ‘regexp’ is matched in the string ‘str’.
regexp_extract(str, pattern, idx)Retrieve a specific group matched by the Java regex ‘regexp’ from the designated string column.
regexp_extract_all(str, regexp[, idx])Retrieve all strings in ‘str’ that match the Java regex ‘regexp’ and correspond to the specified regex group index.
regexp_replace(string, pattern, replacement)Substitute all substrings in the given string value that match the regex ‘regexp’ with the specified ‘replacement’.
regexp_substr(str, regexp)Provides the substring within the string ‘str’ that matches the Java regex ‘regexp’.
regexp_instr(str, regexp[, idx])Retrieve all strings in ‘str’ that match the Java regex ‘regexp’ and correspond to the specified regex group index.
replace(src, search[, replace])Replaces all occurrences of search with replace.
right(str, len)Outputs the rightmost ‘len’ (where ‘len’ can be of string type) characters from the string ‘str’; if ‘len’ is less than or equal to 0, the result is an empty string.
ucase(str)Returns str with all characters changed to uppercase.
unbase64(col)Decodes a BASE64 encoded string column and outputs it as a binary column.
rpad(col, len, pad)Pad the string column on the right side with the specified padding ‘pad’ to achieve the width ‘len’.
repeat(col, n)Duplicates a string column ‘n’ times and outputs it as a new string column.
rtrim(col)Remove trailing spaces from the given string value.
soundex(col)Produces the SoundEx encoding for a given string.
split(str, pattern[, limit])Divides the string ‘str’ based on occurrences of the specified pattern.
split_part(src, delimiter, partNum)Divides the string ‘str’ by a delimiter and returns the specified part of the split (1-based).
startswith(str, prefix)Returns a boolean.
substr(str, pos[, len])Provides the substring of ‘str’ starting at position ‘pos’ with a length of ‘len’, or the byte array slice starting at ‘pos’ with a length of ‘len’.
substring(str, pos, len)Returns the substring starting at position ‘pos’ with a length of ‘len’ when ‘str’ is of String type, or the byte array slice starting at ‘pos’ with a length of ‘len’ when ‘str’ is of Binary type.
substring_index(str, delim, count)Outputs the substring from the string ‘str’ before the occurrence of the delimiter ‘delim’ for a specified count.
overlay(src, replace, pos[, len])Replace a specified portion of the source string ‘src’ with ‘replace’, starting from byte position ‘pos’ and extending for ‘len’ bytes.
sentences(string[, language, country])Divides a string into arrays of sentences, where each sentence is represented as an array of words.
to_binary(col[, format])Transforms the input column ‘col’ into a binary value using the provided format.
to_char(col, format)Transform the column ‘col’ into a string based on the specified format.
to_number(col, format)Transform the string column ‘col’ into a numerical value based on the specified string format ‘format’.
to_varchar(col, format)Transform the column ‘col’ into a string based on the specified format.
translate(srcCol, matching, replace)A function that replaces each character in the source column ‘srcCol’ with a corresponding character in the mapping.
trim(col)Remove leading and trailing spaces from the specified string column.
upper(col)Converts a string expression to upper case.
url_decode(str)Decodes a string ‘str’ in ‘application/x-www-form-urlencoded’ format using a specified encoding scheme.
url_encode(str)Converts a string into ‘application/x-www-form-urlencoded’ format using a specified encoding scheme.
PySpark String Functions

To use the examples below, make sure you have created a SparkSession object.


# Import
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("SparkByExamples").getOrCreate()

2. String Concatenate Functions

pyspark.sql.functions provides two functions concat() and concat_ws() to concatenate DataFrame columns into a single column. In this section, we will learn the usage of concat() and concat_ws() with examples.

2.1 concat()

concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types, such as string, binary, and compatible array columns.

Syntax


# Syntax
pyspark.sql.functions.concat(*cols)

Below is an example of using the PySaprk concat() function on the select() function. select() is a transformation function and returns a new DataFrame with the selected columns.


#Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,col
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

# Using select() with concat()
df2=df.select(concat(df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df2.show(truncate=False)

In the above example, using concat() function of Pyspark SQL, I have concatenated three input string columns(firstname, middlename, lastname) into a single string column(FullName). Below is the output from the above example.


# Output
--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

2.2 concat_ws()

Using concat_ws() function of Pypsark SQL concatenated three string input columns (firstname, middlename, lastname) into a single string column (Fullname) and separated each column with a “_” separator.

Syntax


# Syntax
pyspark.sql.functions.concat_ws(sep,*cols)

Example


# Imports
from pyspark.sql.functions import concat_ws,col
df3=df.select(concat_ws('_',df.firstname,df.middlename,df.lastname)
              .alias("FullName"),"dob","gender","salary")
df3.show(truncate=False)

Below is the output for the concat_ws() function of Pyspark SQL.


# Output
+--------------+----------+------+------+
|FullName      |dob       |gender|salary|
+--------------+----------+------+------+
|JamesSmith    |1991-04-01|M     |3000  |
|MichaelRose   |2000-05-19|M     |4000  |
|RobertWilliams|1978-09-05|M     |4000  |
|MariaAnneJones|1967-12-01|F     |4000  |
|JenMaryBrown  |1980-02-17|F     |-1    |
+--------------+----------+------+------+

3. Substring Functions

PySpark has different ways to get the substring from a column. In this section, we will explore each function to extract the substring. Below are the functions to get the substring.

  • substr(str, pos[, len]): Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.
  • substring(str, pos, len): Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
  • substring_index(str, delim, count): Returns the substring from string str before count occurrences of the delimiter delim.

3.1 Using substr() to get the substring of a Column

The substr() function from pyspark.sql.Column type is used for substring extraction. It extracts a substring from a string column based on the starting position and length.

Syntax


# Syntax
pyspark.sql.functions.substr(str: ColumnOrName, pos: ColumnOrName, len: Optional[ColumnOrName] = None) → pyspark.sql.column.Column

Parameters:

  • src Column or str :A column of string.
  • pos Column or str : A column of string, the substring of str that starts at pos.
  • len Column or str, optional : A column of string, the substring of str is of length len.

Below is an example using substr()


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

data = [("John",), ("Jane",), ("Robert",)]
columns = ["name"]
df = spark.createDataFrame(data, columns)

# Add a new column with the substring extraction
df_with_substr = df.withColumn("substr_example", expr("substr(name, 2, 3)"))

# Show the DataFrame
df_with_substr.show()

In the above example, we used the withColumn method along with the expr function to add a new column called “substr_example” to the DataFrame. In this column, we extract a substring starting from the 2nd position with a length of 3 characters. The substr function extracts substrings from the “name” column starting from the 2nd position with a length of 3 characters. The new column “substr_example” contains the extracted substrings.


# Output
+------+--------------+
|  name|substr_example|
+------+--------------+
|  John|           ohn|
|  Jane|           ane|
|Robert|           obe|
+------+--------------+

3.2. Using substring() to get the substring of a Column

Using the substring() function of pyspark.sql.functions module, we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you want to slice.

Syntax


# Syntax
pyspark.sql.functions.substring(str: ColumnOrName, pos: int, len: int)

Parameters

  • str: Column or str: target column to work on.
  • pos: int [starting position in str.]
  • len: int: length of chars.

Note: The position is not zero-based, but 1 based index.

Below is an example of Pyspark substring() using withColumn().


# Imports
import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, substring
spark=SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Create Sample Data
data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)

# Using substring
df.withColumn('year', substring('date', 1,4))\
    .withColumn('month', substring('date', 5,2))\
    .withColumn('day', substring('date', 7,2))
df.printSchema()
df.show(truncate=False)

In the above example, we have created a DataFrame with two columns, id, and date. Here, date is in the form “year month day”. Here I have used substring() on the date column to return sub-strings of date as year, month, and day respectively. Below is the output.


# Output
+---+--------+----+-----+---+
|id |date    |year|month|day|
+---+--------+----+-----+---+
|1  |20200828|2020|08   |28 |
|2  |20180525|2018|05   |25 |
+---+--------+----+-----+---+

4. String Padding Functions

We can use the lpad and rpad functions for left and right padding, respectively. These functions pad a string column with a specified character or characters to a specified length. In certain data formats or systems, fields may need to be of fixed length.

The padding ensures that the strings have a consistent length, making it easier to process and load data. Padding ensures that strings have the same length when performing matching or comparison operations. This is particularly useful when comparing or sorting strings. In some cases, padding might be part of a larger data transformation process. For example, when preparing data for machine learning models, padding can be applied as part of feature engineering.

4.1 lpad() and rpad()

  • pyspark.sql.functions.lpad is used for the left or leading padding of the string.
  • pyspark.sql.functions.rpad is used for the right or trailing padding of the string.

Syntax of lpad


# Syntax
pyspark.sql.functions.lpad(col: ColumnOrName, len: int, pad: str)

Parameters

  • col : Column or str: target column to work on.
  • len : int : length of the final string.
  • pad : str : chars to prepend.

Syntax of rpad


# Syntax 
pyspark.sql.functions.rpad(col: ColumnOrName, len: int, pad: str)

Parameters

  • col: Column or str : target column to work on.
  • len: int : length of the final string.
  • pad: str :chars to append.

Below is the example demonstrating lpad and rpad usage.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import lpad, rpad

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()
data = [("John",), ("Jane",), ("Robert",)]
columns = ["name"]
df = spark.createDataFrame(data, columns)

# Left Padding (lpad)
df_with_left_padding = df.withColumn("left_padded_name", lpad(df["name"], 10, "0"))

# Right Padding (rpad)
df_with_right_padding = df.withColumn("right_padded_name", rpad(df["name"], 10, "X"))

df_with_left_padding.show()
df_with_right_padding.show()

# Output
# Left Padding
+------+----------------+
|  name|left_padded_name|
+------+----------------+
|  John|      000000John|
|  Jane|      000000Jane|
|Robert|      0000Robert|
+------+----------------+

# Right Padding
+------+-----------------+
|  name|right_padded_name|
+------+-----------------+
|  John|       JohnXXXXXX|
|  Jane|       JaneXXXXXX|
|Robert|       RobertXXXX|
+------+-----------------+

5. String Split()

pyspark.sql.functions offers the split() function for breaking down string columns in DataFrames into multiple columns. This guide illustrates the process of splitting a single DataFrame column into multiple columns using withColumn() and select(). Additionally, it provides insights into incorporating regular expressions (regex) within the split function for enhanced functionality.

5.1 Split Column using withColumn()

Splitting a column into multiple columns in PySpark is achieved using the split() function along with withColumn(). This method involves specifying a delimiter or pattern and applying split() to the target column. The resulting array is then assigned to new columns using withColumn().

For instance, when breaking a comma-separated string into separate columns for first and last names, the code snippet utilizes split(full_name, ",") and assigns the resulting array elements to new columns. This approach is versatile, allowing customization based on delimiter or pattern, providing a robust mechanism for handling string manipulation in PySpark DataFrames efficiently and flexibly.

Below is the example of split() with withColumn.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import split

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John,Doe",), ("Jane,Smith",), ("Robert,Johnson",)]
columns = ["full_name"]

df = spark.createDataFrame(data, columns)

# Use the split function to split the "full_name" column by comma
split_columns = split(df["full_name"], ",")

# Add the split columns to the DataFrame
df_with_split = df.withColumn("first_name", split_columns[0]).withColumn("last_name", split_columns[1])

# Show the DataFrame
df_with_split.show()

# Output
+--------------+----------+---------+
|     full_name|first_name|last_name|
+--------------+----------+---------+
|      John,Doe|      John|      Doe|
|    Jane,Smith|      Jane|    Smith|
|Robert,Johnson|    Robert|  Johnson|
+--------------+----------+---------+

In this example, the split function is used to split the “full_name” column by the comma (,), resulting in an array of substrings. The split columns are then added to the DataFrame using withColumn().

If you have a dynamic number of split columns, you can use the getItem() function to access elements at specific indices in the array. For example:


# Add dynamic number of split columns to the DataFrame
df_with_split_dynamic = df.withColumn("first_name",split_columns.getItem(0))
                          .withColumn("last_name",split_columns.getItem(1))

5.2 Split Column using Select()

Splitting a column into multiple columns in PySpark can be accomplished using the select() function. By incorporating the split() function within select(), a DataFrame’s column is divided based on a specified delimiter or pattern. The resultant array is then assigned to new columns using alias() to provide meaningful names.

For example, when dealing with a comma-separated string column, select() facilitates the creation of distinct columns for each element in the split array. This method is efficient for organizing and extracting information from strings within PySpark DataFrames, offering a streamlined approach to handle string manipulations while selectively choosing the desired columns.

Example on split() with select():


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import split

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John,Doe",), ("Jane,Smith",), ("Robert,Johnson",)]
columns = ["full_name"]

df = spark.createDataFrame(data, columns)

# Use the split function to split the "full_name" column by comma
df_with_split = df.select("full_name", split(df["full_name"], ",").alias("split_names"))

# Expand the array column into separate columns
df_expanded = df_with_split.select(
    "full_name",
    df_with_split["split_names"].getItem(0).alias("first_name"),
    df_with_split["split_names"].getItem(1).alias("last_name")
)

# Show the DataFrame
df_expanded.show()

# Output
+--------------+----------+---------+
|     full_name|first_name|last_name|
+--------------+----------+---------+
|      John,Doe|      John|      Doe|
|    Jane,Smith|      Jane|    Smith|
|Robert,Johnson|    Robert|  Johnson|
+--------------+----------+---------+

Note: For a detailed explanation of split() please refer to PySparkSplit().

6. Other Functions

6.1 contains()

contains() in PySpark String Functions is used to check whether a PySpark DataFrame column contains a specific string or not, you can use the contains() function along with the filter operation.

For a more detailed explanation please refer to the contains() article.

  • contains() – Returns a boolean. The value is True if the right is found inside the left. Returns NULL if either input expression is NULL. Otherwise, returns False. Both left or right must be of STRING or BINARY type.
  • This function is available in Column class.

You can also match by wildcard character using like() & match by regular expression by using rlike() functions.

Syntax


# Syntax
pyspark.sql.functions.contains(left: ColumnOrName, right: ColumnOrName)

Parameters:

  • left: Column or str: The input column or strings to check, may be NULL.
  • right: Column or str: The input column or strings to find, may be NULL.

Below is an example of using contains() with a filter.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John Doe",), ("Jane Smith",), ("Robert Johnson",)]
columns = ["full_name"]

df = spark.createDataFrame(data, columns)

# Specify the string to check for
substring_to_check = "Smith"

# Use filter and contains to check if the column contains the specified substring
filtered_df = df.filter(col("full_name").contains(substring_to_check))

# Show the DataFrame
filtered_df.show()

# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

6.2 regexp_extract()

The regexp_extract function is a valuable tool for retrieving substrings according to a specified regular expression pattern. regexp_extract extracts substrings based on a specified pattern, allowing flexible text parsing. For instance, extracting specific elements or patterns from unstructured data.

Syntax


# Syntax
pyspark.sql.functions.regexp_extract(str: ColumnOrName, pattern: str, idx: int) 

Parameters:

  • str: Column or str : A target column to work on.
  • pattern: str: Regex pattern to apply.
  • idx: int: Matched group id.

Below is an example.


# Imports

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John Doe",), ("Jane Smith",), ("Robert Johnson",)]
columns = ["full_name"]

df = spark.createDataFrame(data, columns)

# Use regexp_extract to extract the first word
df_extracted = df.withColumn("first_name", regexp_extract(df["full_name"], r'(\w+)', 1))

# Show the DataFrame
df_extracted.show()


# Output
+--------------+----------+
|     full_name|first_name|
+--------------+----------+
|      John Doe|      John|
|    Jane Smith|      Jane|
|Robert Johnson|    Robert|
+--------------+----------+

In the above example, the regexp_extract function extracts the first word from the “full_name” column using the regular expression r'(\w+)’. The resulting DataFrame, df_extracted, includes a new column named “first_name” with the extracted values.

6.3 regexp_replace()

regexp_replace in PySpark is a vital function for pattern-based string replacement. It efficiently replaces substrings within a DataFrame column using specified regular expressions.

regexp_replace facilitates pattern-based string replacement, enabling efficient data cleansing and transformation. These functions are essential in scenarios requiring intricate string operations, where regular expressions offer a versatile approach. The flexibility and functionality provided by PySpark’s regexp functions contribute to effective data preprocessing and manipulation, especially when dealing with diverse and complex textual data.

Syntax


# Syntax
pyspark.sql.functions.regexp_replace(string: ColumnOrName, pattern: Union[str, pyspark.sql.column.Column], replacement: Union[str, pyspark.sql.column.Column])

Parameters:

  • string : Column or str : Column name or column containing the string value
  • pattern : Column or str: Column object or str containing the regexp pattern
  • replacement:Column or str: Column object or str containing the replacement.

Example


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John Doe",), ("Jane Smith",), ("Robert Johnson",)]
columns = ["full_name"]

df = spark.createDataFrame(data, columns)

# Use regexp_replace to replace spaces with underscores
df_replaced = df.withColumn("name_with_underscore", regexp_replace(df["full_name"], r' ', '_'))

# Show the DataFrame
df_replaced.show()


# Output
+--------------+---------------------+
|      full_name|name_with_underscore|
+--------------+---------------------+
|       John Doe|            John_Doe|
|     Jane Smith|          Jane_Smith|
|Robert Johnson|       Robert_Johnson|
+--------------+---------------------+

In this example, the regexp_replace function is used to replace spaces with underscores in the “full_name” column based on the regular expression pattern r' '. The resulting DataFrame, df_replaced includes a new column named “name_with_underscore” with the modified values.

Conclusion

In conclusion, PySpark SQL string functions offer a comprehensive toolkit for efficiently manipulating and transforming string data within DataFrames. Functions like split, regexp_extract, and regexp_replace empower users to parse, extract, and modify textual information while concat, lpad, and rpad facilitate concatenation and padding operations. These functions enhance the versatility of PySpark in handling diverse string-related tasks, from data cleansing to feature engineering. Their flexibility, combined with the distributed processing power of PySpark, makes it a robust choice for scalable and effective string data manipulation in big data environments.

Prabha

Prabha is an accomplished data engineer with a wealth of experience in architecting, developing, and optimizing data pipelines and infrastructure. With a strong foundation in software engineering and a deep understanding of data systems, Prabha excels in building scalable solutions that handle diverse and large datasets efficiently. At SparkbyExamples.com Prabha writes her experience in Spark, PySpark, Python and Pandas.