PySpark Filter using contains() Examples

PySpark SQL contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame. contains() function works in conjunction with the filter() operation and provides an effective way to select rows based on substring presence within a string column.

PySpark SQL contains() Syntax and Usage

Syntax:


# Syntax
pyspark.sql.functions.contains(left: ColumnOrName, right: ColumnOrName)

Parameters:

left: Column or str: The input column or strings to check, may be NULL.
right: Column or str: The input column or strings to find, may be NULL.

Returns a boolean. The value is True if the right is found inside the left. Returns NULL if either input expression is NULL. Otherwise, it returns False. Both left or right must be of STRING or BINARY type.

To explain contains() with filter() examples, first, let’s create a DataFrame with some test data.


# Imports
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John Doe",), ("Jane Smith",), ("Robert Johnson",)]
columns = ["full_name"]
df = spark.createDataFrame(data, columns)
df.show()


# Output
+---------------+
|      full_name|
+---------------+
|       John Doe|
|     Jane Smith|
| Robert Johnson|
+---------------+

1. Filter DataFrame Rows using contains() in a String

The PySpark contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). This returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contain string Smith on the full_name column.


# Import
from pyspark.sql.functions import col

# Specify the string to check for
substring_to_check = "Smith"

# Use filter and contains to check if the column contains the specified substring
filtered_df = df.filter(col("full_name").contains(substring_to_check))

# Show the DataFrame
filtered_df.show()

Yields below output.


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

2. PySpark SQL Query using CONTAINS Example

Alternatively, we can use the SQL Query along with the contains() to perform the DataFrmae column contained in the string. This returns a BOOLEAN. If expr or subExpr are NULL, the result is NULL. If subExpr is the empty string or empty binary, the result is true.


# SQL Syntax
contains(expr, subExpr)

Below are some scenarios and the results.

contains() example	Result
contains(NULL, ‘Smith’)	NULL
contains(‘Smith’, NULL)	NULL
contains(‘Smith’, ‘mith’)	True
contains(‘Smith’, ‘Mith’)	False
contains(‘Smith’, ”);	True

PySpark SQL contain() Truth Table

Below is an example.


# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Use SQL to filter rows based on substring containment
result_df = spark.sql("SELECT * FROM people WHERE full_name contains(full_name,'Smith')")

# Show the result DataFrame
result_df.show()

In this example, the contains() is used in a PySpark SQL query to filter rows where the “full_name” column contains the specified substring (“Smith”).


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

3. PySpark contains() with Case-Insensitive

To filter rows by checking column contains in string by ignoring case (case-insesnitive), the lower() function can be used to transform the column to lowercase before applying contains(). This ensures that the check is insensitive to letter case, allowing for a more flexible and inclusive search.


# Syntax
from pyspark.sql.functions import lower
lower(col("column_name")).contains("value")

Where “column_name” represents the targeted column and “value” is the substring of interest.


# Import
from pyspark.sql.functions import col
from pyspark.sql.functions import lower

# Case-insensitive containment check for the substring "smith"
filtered_df = df.filter(lower(col("full_name")).contains("smith"))

# Show the DataFrame
filtered_df.show()

Yields below output.


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

4. Filter Rows Not Containing a Substring

The contains() function can be negated to filter rows not containing a specific substring by utilizing the tilde (~) operator before contains().


# Syntax
~col("column_name").contains("value")

filters rows where the “column_name” does not contain the specified substring. This negation functionality is valuable for excluding specific patterns and tailoring DataFrames to meet particular criteria, providing a concise and efficient means of refining data based on the absence of certain substrings.


# Negation - Filter rows where "full_name" does not contain the substring "Smith"
filtered_df = df.filter(~col("full_name").contains("Smith"))
filtered_df.show()

Yields below output.


# Output
+---------------+
|      full_name|
+--------------+
|       John Doe|
|Robert Johnson |
+---------------+

5. Multiple Conditions with `&` (AND) and `|` (OR)

The PySpark SQL contains() function can be combined with logical operators & (AND) and | (OR) to create complex filtering conditions based on substring containment.


# Syntax
col("column_name").contains("value1") & col("other_column").contains("value2")

The above code filters rows where both specified substrings are present. Whereas the below snippet includes rows where at least one of the substrings is found.


# Syntax
col("column_name").contains("value1") | col("other_column").contains("value2")

Example:


# Sample DataFrame
data = [("John Doe", "Engineer"), ("Jane Smith", "Manager"), ("Robert Johnson", "Analyst")]
columns = ["full_name", "job_title"]
df = spark.createDataFrame(data, columns)

# Filter rows where "full_name" contains "Smith" AND "job_title" contains "Manager"
filtered_df_and = df.filter(col("full_name").contains("Smith") & col("job_title").contains("Manager"))

# Filter rows where "full_name" contains "Johnson" OR "job_title" contains "Analyst"
filtered_df_or = df.filter(col("full_name").contains("Johnson") | col("job_title").contains("Analyst"))

# Show the DataFrames
filtered_df_and.show()
filtered_df_or.show()

Yields below output.


# Output
+-----------+---------+
|  full_name|job_title|
+-----------+---------+
|Jane Smith |  Manager|
+-----------+---------+

+--------------+-----------+
|     full_name|job_title  |
+--------------+-----------+
|Robert Johnson|   Analyst |
|    Jane Smith|    Manager|
+--------------+-----------+

In the above example, first, rows are filtered where “full_name” contains “Smith” AND “job_title” contains “Manager“. And second, rows are filtered where “full_name” contains “Johnson” OR “job_title” contains “Analyst“.

6. Column Contains in Another Column

You can check if a column contains a specific value from another column using the contains function provided by the pyspark.sql.functions module.

For example, you have a DataFrame named df with two columns, column1 and column2, and you want to check if the values in column2 are contained in the values of column1 and filter it.


# Import
from pyspark.sql.functions import col

# Sample data
data = [("Orange Street", "Street"), ("Main Street", "Blvd"), ("Alton Blvd", "Blvd")]

# Define the DataFrame schema
columns = ["FullStreet", "Street"]
df = spark.createDataFrame(data, columns)

# Use contains to check if values in Street are contained in FullStreet
result_df = df.filter(col("FullStreet").contains(col("Street")))

# Show the result
result_df.show()

Yields below output. The contains function returns a boolean value (true or false) for each row based on the containment check, results with false are ignored and results with true are returned as a new DataFrame.


# Output
+-------------+------+
|   FullStreet|Street|
+-------------+------+
|Orange Street|Street|
|   Alton Blvd|  Blvd|
+-------------+------+

Conclusion

In summary, the contains() function in PySpark is utilized for substring containment checks within DataFrame columns and it can be used to derive a new column or filter data by checking string contains in another string. It evaluates whether one string contains another, providing a boolean result for each row. The function is case-sensitive by default, and its integration into DataFrame operations facilitates efficient and expressive substring-related data manipulations.

Happy Learning !!

PySpark SQL contains() Syntax and Usage

1. Filter DataFrame Rows using contains() in a String

2. PySpark SQL Query using CONTAINS Example

3. PySpark contains() with Case-Insensitive

4. Filter Rows Not Containing a Substring

5. Multiple Conditions with & (AND) and | (OR)

6. Column Contains in Another Column

Conclusion

Related Articles

5. Multiple Conditions with `&` (AND) and `|` (OR)