• Post author:
  • Post category:PySpark
  • Post last modified:May 2, 2025
  • Reading time:16 mins read

The like() function in PySpark is used to filter rows based on pattern matching using wildcard characters, similar to SQL’s LIKE operator. This is especially useful when you want to match strings using wildcards such as % (any sequence of characters) and _ (a single character). You can use this function to filter the DataFrame rows by single or multiple conditions, to derive a new column.

Advertisements

In this article, you’ll learn how to use the like() function in PySpark to filter DataFrame rows with the help of wildcard characters with practical examples.

Key Points-

  • The like() function is used to filter DataFrame rows based on pattern matching.
  • It works similarly to SQL’s LIKE operator.
  • Supports % to match zero or more characters.
  • Supports _ to match exactly one character.
  • Returns a boolean column indicating if the pattern is matched.
  • Multiple like() conditions can be combined using | (OR) or & (AND).
  • The like() function is case-sensitive by default.
  • For case-insensitive matching, use ilike() (available in PySpark 3.1+).
  • Numeric columns can be cast to strings before using like().
  • If no rows match the pattern, it returns an empty DataFrame.

PySpark SQL like() Function

The like() function in PySpark is used to filter or transform DataFrame rows by checking if a specified column’s string value matches a given pattern, similar to SQL’s LIKE operator. It supports wildcard characters such as _ (for a single character) and % (for multiple characters) to perform pattern matching and returns a boolean column indicating whether each row meets the condition.

Syntax of like()

Below is the syntax of the like() function.


Following is the syntax of like()
column_name.like("pattern")

Parameters

  • %: Using this character, you can match zero or more characters.
  • _: Using this, you can match exactly one character.

Return Value

It returns a boolean column indicating whether each row meets the specified pattern-matching condition.

Create a Sample DataFrame

Let’s start with a simple DataFrame for implementing examples of the like() function.


# Create DataFrame 
# Import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(1,"James Smith"), (2,"Michael Rose"),
  (3,"Robert Williams"), (4,"Rames Rose"),(5,"Rames rose")]
df = spark.createDataFrame(data=data,schema=["id","name"])
df.show()

Yields below the output.


# Output:
+---+---------------+
| id|           name|
+---+---------------+
|  1|    James Smith|
|  2|   Michael Rose|
|  3|Robert Williams|
|  4|     Rames Rose|
|  5|     Rames rose|
+---+---------------+

PySpark like() with Wildcards

You can use wildcard characters like _ and % with the like() function to filter rows in a DataFrame. Let’s apply these wildcards to a specific column in the given DataFrame to filter rows based on a single-character (_) or multi-character (%) pattern matches.


# Use like() to filter rows by _ wildcard match
df.filter(col("name").like("_a%")).show()

# Use like() to filter rows by % wildcard match
df.filter(col("name").like("%rose%")).show()

Yields below the output.


# Output:

+---+-----------+
| id|       name|
+---+-----------+
|  1|James Smith|
|  4| Rames Rose|
|  5| Rames rose|
+---+-----------+

+---+----------+
| id|      name|
+---+----------+
|  5|Rames rose|
+---+----------+

From the above examples, "_a%" matches any string where the second character is 'a', followed by any number of characters (including none). On the other hand, "%rose%" matches any string that contains the substring 'rose' anywhere within it, and the match is case-sensitive.

PySpark like() with Multiple Values

You can combine multiple like expressions using logical operators such as | (OR) and & (AND) to filter rows in a DataFrame. In this example, I will implement the multiple patterns using like() function along with the OR (|) operator.


# Fiter rows by using multiple patterns
df.filter((col("name").like("Rames%")) | (col("name").like("%Rose%"))).show()

Yields below the output.


# Output:
+---+------------+
| id|        name|
+---+------------+
|  2|Michael Rose|
|  4|  Rames Rose|
|  5|  Rames rose|
+---+------------+

From the above code, "Rames%", matches any string that begins with 'Rames' and is followed by zero or more characters. Similarly, "%Rose%" matches any string that contains 'Rose' anywhere within it. The | operator is used to combine both conditions with a logical OR, meaning rows that match either of the patterns will be included in the result.

PySpark like() Case Insensitive

By default, the like() function is case-sensitive in PySpark. To perform case-insensitive pattern matching, you can use the ilike() function (introduced in PySpark 3.1+), which allows filtering DataFrame rows without considering letter casing.


# Using Pyspark ilike filter rows by case-insenitive
df.filter(col("name").ilike("%rose")).show()

Yields below the output.


# Output:
+---+------------+
| id|        name|
+---+------------+
|  2|Michael Rose|
|  4|  Rames Rose|
|  5|  Rames rose|
+---+------------+

From the above code, "%rose" matches any string that ends with 'rose', allowing for any number of characters (including none) to precede it.

PySpark Filter with Multiple like() Conditions

To filter rows using multiple like() conditions in PySpark, you can combine them with the logical AND (&) operator. This allows you to apply multiple pattern checks at once. The resulting DataFrame will include only those rows where both conditions are true.


PySpark Filter with Multiple like() Conditions
df.filter((col("name").like("Rames%")) & (col("name").like("%Rose"))).show()

Yields below the output.


# Output:
+---+-----------+
| id|      name |
+---+-----------+
|  4|Rames Rose |
+---+-----------+

Frequently Asked Questions of PySpark like() Function

What does the like() function do in PySpark?

The like() function in PySpark is used to filter rows by checking if a column’s string value matches a specified pattern, similar to SQL’s LIKE operator.

What wildcard characters can be used with like()?

You can use:
%: Matches any sequence of characters (including none).
_: Matches a single character.

How do I filter rows where a column starts with a specific word?

Use the % wildcard after the word. For example:
df.filter(col("name").like("John%"))

How do I check if a column contains a certain substring?

Surround the substring with %. For example:
df.filter(col("name").like("%smith%"))

How do I use multiple llike() conditions?

Combine them using | (OR) or & (AND). For example:
df.filter((col("name").like("Ram%")) | (col("name").like("%Rose")))

How can I use like() with numeric columns?

You can convert a numeric column to a string using casting, and then apply the like() function for pattern matching. For example:
df.filter(col("id").cast("string").like("123%"))

What’s the difference between like() and rlike()?

like() uses SQL-style wildcards (%, _).
rlike() supports regular expressions, giving you more powerful matching.

What happens if the pattern doesn’t match any rows?

The filter returns an empty DataFrame with the same schema, just no rows.

Conclusion

In this article, I explained how the like() function in PySpark enables SQL-style pattern matching using wildcards, % for matching multiple characters and _ for a single character. I also covered how to filter substrings, combine multiple patterns, and perform case-insensitive matching using ilike(). Together, like() and ilike() offer powerful and flexible options for string filtering in PySpark workflows.

Happy Learning!!

References