In PySpark, understanding the concept of like()
vs rlike()
vs ilike()
is essential, especially when working with text data. You can use these functions to filter rows based on specific patterns, such as checking if a name contains both uppercase and lowercase letters or ends with a certain keyword. PySpark makes it easy to handle such cases with its powerful set of string functions.
In this article, I will explain the like(), rlike(), and ilike() functions in PySpark, how they are used for processing text data, and the key differences between them.
like()
– Performs SQL-style pattern matchingrlike()
– Uses regular expressions for pattern matchingilike()
– Performs case-insensitive SQL-style pattern matching
By the end of this article, you will understand:
- How each function works
- The main differences between them
- Practical examples for real-world use cases
- Best practices for effective usage
- Answers to frequently asked questions to clear up common doubts
Key points-
- Use
like()
for SQL-style wildcard matching with%
and_
. like()
is case-sensitive, mind the letter casing.- Use
rlike()
for powerful regex-based pattern matching. rlike()
supports expressions like(?i)
to ignore case.- Use
ilike()
for case-insensitive SQL-style matching (Spark 3.0+). ilike()
simplifies matching without regex when case doesn’t matter.- Use
filter()
orwhere()
with these functions to apply conditions. - Cast numeric columns to strings before pattern matching.
rlike()
is best for complex or conditional text patterns.- Always choose the function that best fits your case sensitivity and pattern complexity.
like() – SQL-style Pattern Matching
The like()
function is used to perform SQL-style pattern matching in PySpark. It checks if a string column matches a specified pattern using wildcard characters.
Syntax
Following is the syntax of the SQL like()
.
# Syntax of the like() function
col("column_name").like("pattern")
Parameters
pattern
(str): SQL-style string pattern using %
(any number of characters) and _
(single character)wildcards.
Return Value
It returns a column of boolean values. Each value is True
if the corresponding row’s value matches based on the pattern, otherwise False
.
Usage of like()
You can use the like()
function in PySpark to filter rows in a DataFrame by applying SQL-style pattern matching to specific columns. Let’s create a DataFrame and apply the like()
function to the target column (where you want to match string patterns using SQL expressions) to retrieve the filtered rows.
# Use like() function to filter rows of DataFrame
# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
# Create data frame
data = [(1,"James Smith"), (2,"Michael Rose"),
(3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")]
df = spark.createDataFrame(data=data,schema=["id","name"])
df.show()
# Filter where name contains Rose string
# Import col from sql functions in pyspark
from pyspark.sql.functions import col
df.filter(col("name").like("%Rose%")).show()
Yields below the output.
From the above code, the function filters the DataFrame rows based on a case-sensitive match.
rlike() – Regex Matching
In PySpark, the rlike()
function is used to apply regular expressions to string columns for advanced pattern matching. It accepts a regex string as a parameter and returns a Boolean column indicating whether each row matches the expression. By default, rlike()
is case-sensitive. To make it case-insensitive, you must modify the regex pattern (e.g., using (?i)
inline flag). Compared to like()
, rlike()
offers greater power and flexibility.
rlike() Syntax
Following is a syntax of rlike()
function.
# Syntax of the rike() function
col("column_name").rlike("regex_pattern")
Parameters
regex_pattern
(str): A valid regular expression pattern.
Return Value
Returns a boolean column where each element in the Column is matched by an extended regex expression.
Usage
Let’s see an example of using rlike()
to evaluate a regular expression. In the below examples, I use rlike() function to filter the PySpark DataFrame rows by matching patterns with regular expressions, including case-insensitive matching and filtering rows based on specific column values.
# Match names ending with 'rose', case-insensitive
from pyspark.sql.functions import col
# Filter rows by cheking value contains in anohter column by ignoring case
df.filter(col("name").rlike("(?i)rose")).show()
Yields below the output.
# Output:
+---+------------+
| id| name|
+---+------------+
| 2|Michael Rose|
| 4| Rames Rose|
| 5| Rames rose|
+---+------------+
If you don’t use a case-insensitive flag, rlike()
matches patterns in a case-sensitive manner by default.
# Match names ending with 'rose', case-sensitive
df.filter(col("name").rlike("rose")).show()
Yields below the output.
# Output:
+---+----------+
| id| name|
+---+----------+
| 5|Rames rose|
+---+----------+
ilike()-Case-insensitive SQL-style Pattern Matching
The ilike()
function is similar to like()
but performs case-insensitive matching. This is useful when you want simple wildcard matching without worrying about letter casing.
ilike() Syntax
Following is the syntax of ilike().
# Syntax of the ilike() function
col("column_name").ilike("pattern")
Parameters
pattern
(str): SQL-style string pattern using %
and _
matched case-insensitively.
Return Value
Returns a column of boolean values.
Usage of ilike() Function
Let’s apply the ilike()
function on PySpark DataFrame columns to perform case-insensitive SQL-style pattern matching. This function filters rows based on case-insensitive string patterns using SQL expressions.
# Case-insensitive match using SQL-style wildcard
from pyspark.sql.functions import col
df.filter(col("name").ilike("%rose%"))
Yields below the output.
# Output:
+---+------------+
| id| name|
+---+------------+
| 2|Michael Rose|
| 4| Rames Rose|
| 5| Rames rose|
+---+------------+
Differences Between PySpark like(), rlike(), and like()
The table below highlights the key differences among these functions to help you understand when to use each one effectively.
Best Practices
- Always cast numeric columns to string before using these functions
- Use
like()
for simple wildcard matching when case matters. - Use
ilike()
for case-insensitive SQL-style matching (Spark 3.0+). - Use
rlike()
for advanced or flexible pattern matching with regex.
Example:
You can convert the numeric column to a string before filtering rows using pattern-matching functions such as like()
, rlike()
, or ilike()
.
# Filter where id ends with 5
from pyspark.sql.functions import col
df.filter(col("id").cast("string").like("%5")).show()
Yields below the output.
# Output:
+---+----------+
| id| name|
+---+----------+
| 5|Rames rose|
+---+----------+
Frequently Asked Questions of PySpark like() vs rlike() vs ilike()
like()
uses SQL-style pattern matching with %
and _
.rlike()
uses regular expressions, which are more powerful and flexible.
Yes, like()
is case-sensitive. For example, %Rose%
will not match rose
.
Use either:ilike()
for SQL-style patterns (Spark 3.0+), orrlike("(?i)pattern")
for regex-based matching.
ilike()
is the case-insensitive version of like()
, introduced in Spark 3.0+. It uses SQL-style wildcards.
%
→ matches any sequence of characters_
→ matches a single character
like()
and ilike()
only support SQL wildcards. Use rlike()
for regex.
Use: col("name").rlike("(?i)rose$")
This matches strings ending with “rose”, case-insensitively.
Use rlike()
– it supports full regular expressions and can handle complex patterns.
No. ilike()
is available only in Spark 3.0 and above.
Conclusion
In this article, I have explained the concept of PySpark like()
vs rlike()
vs ilike()
for filtering DataFrame rows based on string patterns. Choosing the appropriate function is crucial when working with textual data, as each one handles pattern matching differently. While all three are used to match string patterns, they vary in syntax, pattern type (SQL-style vs. regex), and case sensitivity, making it important to select the one that best fits your specific use case.
Happy Learning!!
Related Articles
- PySpark Filter using contains() Examples
- PySpark How to Filter Rows with NULL Values
- PySpark startsWith() and endsWith() Functions