In PySpark, the rlike() function performs row filtering based on pattern matching using regular expressions (regex). Unlike like() and ilike(), which use SQL-style wildcards (%, _), rlike() supports powerful regex syntax to search for flexible string patterns in DataFrame columns. In this article, I’ll explain how to use the PySpark rlike() function to filter rows effectively, along with practical examples covering various real-world scenarios.
Key Points-
- Advanced Pattern Matching:
rlike()allows filtering DataFrame rows using powerful regular expressions, offering more flexibility thanlike()andilike(). - Case-Sensitive by Default: The
rlike()function performs case-sensitive matching unless you explicitly include the(?i)flag. - Case-Insensitive Matching: To perform case-insensitive matching, add the
(?i)flag at the beginning of the regex pattern. - Start and End Anchors: Use
^to match the start and$to match the end of a string. - Negation (NOT Matching): Use the tilde
~operator to filter out rows that match a pattern. - Combining Conditions: Combine multiple regex patterns using
|(OR) and&(AND) for complex matching. - Supports Regex Quantifiers: Quantifiers like
*,+,{n}, etc., can be used to match repeating patterns. - Returns Boolean Column:
rlike()returns a Boolean column indicating if each row satisfies the pattern. - Empty Result Handling: If no rows match, it returns an empty DataFrame while preserving the original schema.
- Comparison with contains(): Unlike
contains(), which only supports simple substring searches,rlike()enables complex regex-based queries.
PySpark rlike()
PySpark rlike() function is used to apply regular expressions to string columns for advanced pattern matching. It allows a regex string as a parameter and returns a Boolean column indicating whether each row matches the expression. By default, rlike() is case-sensitive; however, you can make it case-insensitive by adding the regex flag (?i) at the start of the pattern. Compared to like(), rlike() offers greater power and flexibility.
Syntax of rlike()
Below is the syntax of the rlike() function.
# Syntax of rlike()
col("column_name").rlike("regex_pattern")
Parameters
regex_pattern (str): A valid regular expression pattern.
Return Value
Returns a boolean column where an extended regular expression matches each element in the Column.
PySpark rlike() Case Insensitive
You can use the rlike() function on a specific DataFrame column to filter rows that match a given regular expression pattern. It returns a DataFrame containing only the matching rows.
Let’s create a sample DataFrame and perform a case-insensitive match using regular expressions with (?i) to filter rows from a DataFrame.
# PySpark case-insentive match
# Create SparkSession
# Create DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"James Smith"), (2,"Michael Rose"),
(3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")]
df = spark.createDataFrame(data=data,schema=["id","name"])
df.show()
# Match all variations of "rose" regardless of case
df.filter(col("name").rlike("(?i)rose")).show()
Yields below the output.

PySpark SQL rlike() Case sensitive(Default Behavior)
You can simply pass a regular expression pattern to the rlike() function. By default, it performs case-sensitive matching and returns the rows where the specified column contains a substring that matches the regular expression.
# Match the pattern in default manner
df.filter(col("name").rlike("rose")).show()
# Output:
+---+----------+
| id| name|
+---+----------+
| 5|Rames rose|
+---+----------+
In this case, the pattern "rose" will match only lowercase "rose" in the name column.
PySpark rlike() Multiple Conditions
You can combine multiple rlike() patterns using logical operators like | (OR) and & (AND) for more complex matching conditions. Let’s use the | (OR) operator within the regular expression to filter rows that match any of the specified patterns.
# Match names containing either "Smith" or "Rose"
df.filter((col("name").rlike("Rose")) | (col("name").rlike("Smith"))).show()
# Output:
# Output:
+---+------------+
| id| name|
+---+------------+
| 1| James Smith|
| 2|Michael Rose|
| 4| Rames Rose|
| 5| Rames rose|
+---+------------+
PySpark rlike Opposite
To exclude rows that match a specific regex pattern, you can use the tilde ~ (NOT) operator before the condition.
# Exclude matched rows by pattern
df.filter(~col("name").rlike("Rose")).show()
# Output:
+---+---------------+
| id| name|
+---+---------------+
| 1| James Smith|
| 3|Robert Williams|
| 5| Rames rose|
+---+---------------+
PySpark rlike() vs contains()
The table below highlights the key differences between the rlike() function and the contains() function in PySpark.
| Feature | rlike() | contains() |
|---|---|---|
| Pattern Type | Regex | Plain substring |
| Case Sensitivity | Yes (use (?i) for ignore) | Yes (always) |
| Flexibility | High (regex-based) | Low (no patterns) |
PySpark rlike wildcard
So far, we have used rlike() to filter rows where a specified column matches a simple string-based regex pattern. In this example, we’ll explore how to use rlike() with wildcard characters (such as .*, ^, $, etc.) to filter rows that match more complex patterns.
# Match names that contain any characters before "Smith"
df.filter(col("name").rlike(".*Smith")).show()
# Output:
+---+-----------+
| id| name|
+---+-----------+
| 1|James Smith|
In this case, the name ends with "Smith" based on the regex pattern ".*Smith", where .* acts as a wildcard representing any sequence of characters preceding "Smith".
FAQ’s of PySpark SQL rlike() Function
The rlike() function filters DataFrame rows by checking if a column’s value matches a specified regular expression (regex) pattern. EXample:df.filter(col("name").rlike("Rose")).show()
Yes, by default, rlike() is case-sensitive. To perform case-insensitive matching, use the (?i) flag inside the regex pattern. For example:df.filter(col("name").rlike("(?i)rose")).show()
Add (?i) at the beginning of the regex pattern to ignore case sensitivity. For example:df.filter(col("name").rlike("(?i)^rames")).show()
You can use logical operators like | (OR) or & (AND) to combine multiple rlike() conditions. For example:df.filter((col("name").rlike("Rose")) | (col("name").rlike("Smith"))).show()
Use the tilde ~ (NOT) operator before the rlike() condition. For example:df.filter(~col("name").rlike("Rose")).show()
like() uses SQL-style wildcards (%, _) for simple pattern matching.rlike() supports full regular expressions for advanced pattern matching.
like() exampledf.filter(col("name").like("%Rose%")).show()
rlike() exampledf.filter(col("name").rlike("Rose$")).show()
Since rlike() uses standard regex, you can use quantifiers to match repeating characters or patterns.
Example (zero or more ‘s’):df.filter(col("name").rlike("Rames*")).show()
rlike()? Use ^ to match the start and $ to match the end of the string.
For example:df.filter(col("name").rlike("^James")).show()df.filter(col("name").rlike("Rose$")).show()
If no rows match, PySpark will return an empty DataFrame with the same schema (columns remain, but no data).
Conclusion
In this article, I have explained how to use PySpark’s rlike() function to filter rows based on regex pattern matching in string columns. I also covered handling more advanced scenarios, such as using regex wildcards, combining multiple conditions, and performing case-insensitive matches. Compared to functions like like(), ilike(), or contains(), rlike() offers much greater flexibility—whether you’re dealing with case-sensitive searches, building complex match patterns, or excluding specific text.