• Post author:
  • Post category:PySpark
  • Post last modified:May 5, 2024
  • Reading time:13 mins read
You are currently viewing PySpark Filter using contains() Examples

PySpark SQL contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame. contains() function works in conjunction with the filter() operation and provides an effective way to select rows based on substring presence within a string column.

Advertisements

Key Points on PySpark contains()

  • Substring Containment Check: The contains() function in PySpark is used to perform substring containment checks. It evaluates whether one string (column) contains another as a substring.
  • Boolean Result: The result of the contains() function is a boolean value (True or False). It indicates whether the substring is present in the source string for each row.
  • Case-Sensitive: By default, the contains() function is case-sensitive. It distinguishes between uppercase and lowercase characters. If case-insensitive matching is needed, additional functions like lower() or upper() can be applied to the columns.
  • Column-to-Column Comparison: The contains() function is often used for comparing values between two columns within a PySpark DataFrame, checking if the substring in one column is present in the other column.
  • Integration with DataFrame Operations: The contains() function is seamlessly integrated into PySpark DataFrame operations, allowing for efficient and expressive manipulation of data by adding a new boolean column indicating substring containment.

PySpark SQL contains() Syntax and Usage

Syntax:


# Syntax
pyspark.sql.functions.contains(left: ColumnOrName, right: ColumnOrName)

Parameters:

  • left: Column or str: The input column or strings to check, may be NULL.
  • right: Column or str: The input column or strings to find, may be NULL.

Returns a boolean. The value is True if the right is found inside the left. Returns NULL if either input expression is NULL. Otherwise, it returns False. Both left or right must be of STRING or BINARY type.

To explain contains() with filter() examples, first, let’s create a DataFrame with some test data.


# Imports
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("sparkbyexamples.com").getOrCreate()

# Sample DataFrame
data = [("John Doe",), ("Jane Smith",), ("Robert Johnson",)]
columns = ["full_name"]
df = spark.createDataFrame(data, columns)
df.show()

# Output
+---------------+
|      full_name|
+---------------+
|       John Doe|
|     Jane Smith|
| Robert Johnson|
+---------------+

1. Filter DataFrame Rows using contains() in a String

The PySpark contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). This returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contain string Smith on the full_name column.


# Import
from pyspark.sql.functions import col

# Specify the string to check for
substring_to_check = "Smith"

# Use filter and contains to check if the column contains the specified substring
filtered_df = df.filter(col("full_name").contains(substring_to_check))

# Show the DataFrame
filtered_df.show()

Yields below output.


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

2. PySpark SQL Query using CONTAINS Example

Alternatively, we can use the SQL Query along with the contains() to perform the DataFrmae column contained in the string. This returns a BOOLEAN. If expr or subExpr are NULL, the result is NULL. If subExpr is the empty string or empty binary, the result is true.


# SQL Syntax
contains(expr, subExpr)

Below are some scenarios and the results.

contains() exampleResult
contains(NULL, ‘Smith’)NULL
contains(‘Smith’, NULL)NULL
contains(‘Smith’, ‘mith’)True
contains(‘Smith’, ‘Mith’)False
contains(‘Smith’, ”);True
PySpark SQL contain() Truth Table

Below is an example.


# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Use SQL to filter rows based on substring containment
result_df = spark.sql("SELECT * FROM people WHERE full_name contains(full_name,'Smith')")

# Show the result DataFrame
result_df.show()

In this example, the contains() is used in a PySpark SQL query to filter rows where the “full_name” column contains the specified substring (“Smith”).


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

3. PySpark contains() with Case-Insensitive

To filter rows by checking column contains in string by ignoring case (case-insesnitive), the lower() function can be used to transform the column to lowercase before applying contains(). This ensures that the check is insensitive to letter case, allowing for a more flexible and inclusive search.


# Syntax
from pyspark.sql.functions import lower
lower(col("column_name")).contains("value")

Where “column_name” represents the targeted column and “value” is the substring of interest.


# Import
from pyspark.sql.functions import col
from pyspark.sql.functions import lower

# Case-insensitive containment check for the substring "smith"
filtered_df = df.filter(lower(col("full_name")).contains("smith"))

# Show the DataFrame
filtered_df.show()

Yields below output.


# Output
+-----------+
|  full_name|
+-----------+
|Jane Smith |
+-----------+

4. Filter Rows Not Containing a Substring

The contains() function can be negated to filter rows not containing a specific substring by utilizing the tilde (~) operator before contains().


# Syntax
~col("column_name").contains("value")

filters rows where the “column_name” does not contain the specified substring. This negation functionality is valuable for excluding specific patterns and tailoring DataFrames to meet particular criteria, providing a concise and efficient means of refining data based on the absence of certain substrings.


# Negation - Filter rows where "full_name" does not contain the substring "Smith"
filtered_df = df.filter(~col("full_name").contains("Smith"))
filtered_df.show()

Yields below output.


# Output
+---------------+
|      full_name|
+--------------+
|       John Doe|
|Robert Johnson |
+---------------+

5. Multiple Conditions with & (AND) and | (OR)

The PySpark SQL contains() function can be combined with logical operators & (AND) and | (OR) to create complex filtering conditions based on substring containment.


# Syntax
col("column_name").contains("value1") & col("other_column").contains("value2")

The above code filters rows where both specified substrings are present. Whereas the below snippet includes rows where at least one of the substrings is found.


# Syntax
col("column_name").contains("value1") | col("other_column").contains("value2")

Example:


# Sample DataFrame
data = [("John Doe", "Engineer"), ("Jane Smith", "Manager"), ("Robert Johnson", "Analyst")]
columns = ["full_name", "job_title"]
df = spark.createDataFrame(data, columns)

# Filter rows where "full_name" contains "Smith" AND "job_title" contains "Manager"
filtered_df_and = df.filter(col("full_name").contains("Smith") & col("job_title").contains("Manager"))

# Filter rows where "full_name" contains "Johnson" OR "job_title" contains "Analyst"
filtered_df_or = df.filter(col("full_name").contains("Johnson") | col("job_title").contains("Analyst"))

# Show the DataFrames
filtered_df_and.show()
filtered_df_or.show()

Yields below output.


# Output
+-----------+---------+
|  full_name|job_title|
+-----------+---------+
|Jane Smith |  Manager|
+-----------+---------+

+--------------+-----------+
|     full_name|job_title  |
+--------------+-----------+
|Robert Johnson|   Analyst |
|    Jane Smith|    Manager|
+--------------+-----------+

In the above example, first, rows are filtered where “full_name” contains “Smith” AND “job_title” contains “Manager“. And second, rows are filtered where “full_name” contains “Johnson” OR “job_title” contains “Analyst“.

6. Column Contains in Another Column

You can check if a column contains a specific value from another column using the contains function provided by the pyspark.sql.functions module.

For example, you have a DataFrame named df with two columns, column1 and column2, and you want to check if the values in column2 are contained in the values of column1 and filter it.


# Import
from pyspark.sql.functions import col

# Sample data
data = [("Orange Street", "Street"), ("Main Street", "Blvd"), ("Alton Blvd", "Blvd")]

# Define the DataFrame schema
columns = ["FullStreet", "Street"]
df = spark.createDataFrame(data, columns)

# Use contains to check if values in Street are contained in FullStreet
result_df = df.filter(col("FullStreet").contains(col("Street")))

# Show the result
result_df.show()

Yields below output. The contains function returns a boolean value (true or false) for each row based on the containment check, results with false are ignored and results with true are returned as a new DataFrame.


# Output
+-------------+------+
|   FullStreet|Street|
+-------------+------+
|Orange Street|Street|
|   Alton Blvd|  Blvd|
+-------------+------+

Conclusion

In summary, the contains() function in PySpark is utilized for substring containment checks within DataFrame columns and it can be used to derive a new column or filter data by checking string contains in another string. It evaluates whether one string contains another, providing a boolean result for each row. The function is case-sensitive by default, and its integration into DataFrame operations facilitates efficient and expressive substring-related data manipulations.

Happy Learning !!