The like()
function in PySpark is used to filter rows based on pattern matching using wildcard characters, similar to SQL’s LIKE
operator. This is especially useful when you want to match strings using wildcards such as %
(any sequence of characters) and _
(a single character). You can use this function to filter the DataFrame rows by single or multiple conditions, to derive a new column.
In this article, you’ll learn how to use the like()
function in PySpark to filter DataFrame rows with the help of wildcard characters with practical examples.
Key Points-
- The
like()
function is used to filter DataFrame rows based on pattern matching. - It works similarly to SQL’s
LIKE
operator. - Supports
%
to match zero or more characters. - Supports
_
to match exactly one character. - Returns a boolean column indicating if the pattern is matched.
- Multiple
like()
conditions can be combined using|
(OR) or&
(AND). - The
like()
function is case-sensitive by default. - For case-insensitive matching, use
ilike()
(available in PySpark 3.1+). - Numeric columns can be cast to strings before using
like()
. - If no rows match the pattern, it returns an empty DataFrame.
PySpark SQL like() Function
The like()
function in PySpark is used to filter or transform DataFrame rows by checking if a specified column’s string value matches a given pattern, similar to SQL’s LIKE
operator. It supports wildcard characters such as _
(for a single character) and %
(for multiple characters) to perform pattern matching and returns a boolean column indicating whether each row meets the condition.
Syntax of like()
Below is the syntax of the like() function.
Following is the syntax of like()
column_name.like("pattern")
Parameters
%:
Using this character, you can match zero or more characters._:
Using this, you can match exactly one character.
Return Value
It returns a boolean column indicating whether each row meets the specified pattern-matching condition.
Create a Sample DataFrame
Let’s start with a simple DataFrame for implementing examples of the like() function.
# Create DataFrame
# Import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"James Smith"), (2,"Michael Rose"),
(3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")]
df = spark.createDataFrame(data=data,schema=["id","name"])
df.show()
Yields below the output.
# Output:
+---+---------------+
| id| name|
+---+---------------+
| 1| James Smith|
| 2| Michael Rose|
| 3|Robert Williams|
| 4| Rames Rose|
| 5| Rames rose|
+---+---------------+
PySpark like() with Wildcards
You can use wildcard characters like _
and %
with the like()
function to filter rows in a DataFrame. Let’s apply these wildcards to a specific column in the given DataFrame to filter rows based on a single-character (_
) or multi-character (%
) pattern matches.
# Use like() to filter rows by _ wildcard match
df.filter(col("name").like("_a%")).show()
# Use like() to filter rows by % wildcard match
df.filter(col("name").like("%rose%")).show()
Yields below the output.
# Output:
+---+-----------+
| id| name|
+---+-----------+
| 1|James Smith|
| 4| Rames Rose|
| 5| Rames rose|
+---+-----------+
+---+----------+
| id| name|
+---+----------+
| 5|Rames rose|
+---+----------+
From the above examples, "_a%"
matches any string where the second character is 'a'
, followed by any number of characters (including none). On the other hand, "%rose%"
matches any string that contains the substring 'rose'
anywhere within it, and the match is case-sensitive.
PySpark like() with Multiple Values
You can combine multiple like
expressions using logical operators such as |
(OR) and &
(AND) to filter rows in a DataFrame. In this example, I will implement the multiple patterns using like() function along with the OR (|
) operator.
# Fiter rows by using multiple patterns
df.filter((col("name").like("Rames%")) | (col("name").like("%Rose%"))).show()
Yields below the output.
# Output:
+---+------------+
| id| name|
+---+------------+
| 2|Michael Rose|
| 4| Rames Rose|
| 5| Rames rose|
+---+------------+
From the above code, "Rames%"
, matches any string that begins with 'Rames'
and is followed by zero or more characters. Similarly, "%Rose%"
matches any string that contains 'Rose'
anywhere within it. The |
operator is used to combine both conditions with a logical OR, meaning rows that match either of the patterns will be included in the result.
PySpark like() Case Insensitive
By default, the like()
function is case-sensitive in PySpark. To perform case-insensitive pattern matching, you can use the ilike() function (introduced in PySpark 3.1+), which allows filtering DataFrame rows without considering letter casing.
# Using Pyspark ilike filter rows by case-insenitive
df.filter(col("name").ilike("%rose")).show()
Yields below the output.
# Output:
+---+------------+
| id| name|
+---+------------+
| 2|Michael Rose|
| 4| Rames Rose|
| 5| Rames rose|
+---+------------+
From the above code, "%rose"
matches any string that ends with 'rose'
, allowing for any number of characters (including none) to precede it.
PySpark Filter with Multiple like() Conditions
To filter rows using multiple like()
conditions in PySpark, you can combine them with the logical AND (&
) operator. This allows you to apply multiple pattern checks at once. The resulting DataFrame will include only those rows where both conditions are true.
PySpark Filter with Multiple like() Conditions
df.filter((col("name").like("Rames%")) & (col("name").like("%Rose"))).show()
Yields below the output.
# Output:
+---+-----------+
| id| name |
+---+-----------+
| 4|Rames Rose |
+---+-----------+
Frequently Asked Questions of PySpark like() Function
The like()
function in PySpark is used to filter rows by checking if a column’s string value matches a specified pattern, similar to SQL’s LIKE
operator.
like()
? You can use:%
: Matches any sequence of characters (including none)._
: Matches a single character.
Use the %
wildcard after the word. For example:df.filter(col("name").like("John%"))
Surround the substring with %
. For example:
df.filter(col("name").like("%smith%"))
l
like() conditions? Combine them using |
(OR) or &
(AND). For example:df.filter((col("name").like("Ram%")) | (col("name").like("%Rose")))
You can convert a numeric column to a string using casting, and then apply the like()
function for pattern matching. For example:df.filter(col("id").cast("string").like("123%"))
like()
uses SQL-style wildcards (%
, _
).rlike()
supports regular expressions, giving you more powerful matching.
The filter returns an empty DataFrame with the same schema, just no rows.
Conclusion
In this article, I explained how the like()
function in PySpark enables SQL-style pattern matching using wildcards, %
for matching multiple characters and _
for a single character. I also covered how to filter substrings, combine multiple patterns, and perform case-insensitive matching using ilike()
. Together, like()
and ilike()
offer powerful and flexible options for string filtering in PySpark workflows.
Happy Learning!!
Related Articles
- Difference between Pyspark like(), rlike(), and ilike() functions.
- Explain the PySpark SQL rlike() function with examples.