PySpark Filter by Case-Insensitive Examples

Prabha

4 months ago

In PySpark, to filter the rows of a DataFrame case-insensitive (ignore case) you can use the lower() or upper() functions to convert the column values to lowercase or uppercase, respectively, and apply the filtering or where condition. These functions are particularly useful when you want to standardize the case of string data for comparison purposes. They allow you to perform case-insensitive searches, comparisons, or filters in PySpark.

Alternatively, we can also use the PySpark ilike() function directly for case-insensitive. The ilike() function is used for case-insensitive pattern matching in string columns. It is similar to the like() function but performs a case-insensitive match.

Key Points on Case Insensitive

Case-Insensitive Filtering: In PySpark, case-insensitive filtering is achieved by using functions like lower() or upper() in conjunction with string comparison functions.
Lowercasing or Uppercasing: Applying lower() or upper() to the relevant columns ensures uniform case representation, facilitating case-insensitive comparisons.
String Comparison after Transformation: Transforming columns to lowercase or uppercase allows for case-insensitive equality checks using standard PySpark comparison operators like ==.
Improved Data Consistency: Case-insensitive filtering enhances data consistency by treating uppercase and lowercase characters as equivalent, preventing match discrepancies.
Integration into DataFrame Operations: The case-insensitive filtering approach seamlessly integrates into PySpark DataFrame operations, enabling efficient and consistent handling of case variations in string data.

To begin with an example, let’s prepare sample data and DataFrame.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a Spark session
spark = SparkSession.builder.appName("SparkByExamples").getOrCreate()

# Create a sample DataFrame
data = [("Ricky", 10), ("Bunny", 15), ("Coco", 8), ("Pippy", 20)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

Output.


# Output
+-----+---+
| name|age|
+-----+---+
|Ricky| 10|
|Bunny| 15|
| Coco|  8|
|Pippy| 20|
+-----+---+

1. Filtering Rows by Case-Insensitive (Ignoring case)

Case-insensitive filter in PySpark is achieved by using functions like lower() or upper() in conjunction with string comparison functions. Let’s look at these functions and learn how they solve case-insensitive filtering. Case-insensitivity allows you to do string operations or filtering without worrying whether characters are upper or lower case.

1.1 Using lower()

Using the lower() function, we can convert the characters in a string column to lowercase. This function is useful for standardizing the case of string data, allowing for case-insensitive comparisons, sorting, or filtering.

The lower() function in PySpark takes a column containing strings as input and returns a new column where all the characters in each string are converted to lowercase. This transformation is valuable when you want to standardize the case of string data, allowing for case-insensitive comparisons, sorting, or filtering in subsequent DataFrame operations.

The code snippet below demonstrates applying the lower() function with the filter to filter strings with case insensitive (ignore case).


# Use lower() with filter to case-insensitive
from pyspark.sql.functions import lower
df_filtered = df.filter(lower("name") == "ricky")

# Diaplay the DataFrame
df_filtered.show()

In the above code, we applied a filter operation on the “name” column with a lower() function. As a result of this operation, the df_filtered DataFrame will be returned with a row containing the “name” column as “ricky“.

Below is the output.


# Output
+-----+---+
| name|age|
+-----+---+
|Ricky| 10|  
+-----+---+

1.2 Using upper()

PySpark upper() function is used to convert the characters in a string column to uppercase. The upper() function in PySpark takes a column containing strings as input and returns a string with uppercase. This allows for uniform string data representation, facilitating case-insensitive operations in analytical and processing tasks within a PySpark application.

The code snippet below demonstrates applying the upper() function with the filter to filter strings with case insensitive (ignore case).


# Use upper() for case-insensitive
from pyspark.sql.functions import upper
df2 = df.filter(upper("name") == "RICKY")
df2.show()

Yields below output.


# Output
+-----+---+
| name|age|
+-----+---+
|Ricky| 10|  
+-----+---+

1.3 Using ilike() with wildcard characters

PySpark ilike() function can also be used to filter the rows of DataFrame by case-insensitive based on a string or pattern match.


# Use where and ilike to get rows where the 'name' column contains 'Pip'
df_filtered = df.where(col("name").ilike("%pip%"))
df_filtered.show()

The above code returns the rows where column “name” matches “pip” by ignoring the case. Since ilike() is case-insensitive, the ilike() ignores the case and returns the row with “name” column having “Pippy” as shown below.


# Output
+-----+---+
| name|age|
+-----+---+
|Pippy| 20|
+-----+---+

Using ilike() with filter() yields the same results as using with where. Below is the code to achieve the same.


# Use filterand ilike to get rows where the 'name' column contains 'Pip'
df_filtered = df.filter(col("name").ilike("%pip%"))
df_filtered.show()

2. PySpark SQL Query to filter rows by case-insensitive

In case you want to use SQL Query in PySpark to filter the DataFrame rows by case-insensitive, you can use the lower() or upper() function as described above and use the equal (=) or LIKE operator to compare.


# Register the DataFrame as a temporary table
df.createOrReplaceTempView("names_table")

# Use Spark SQL to filter rows where the 'name' column contains 'pippy' (case-insensitive)
sql_query = """
   SELECT * FROM names_table WHERE LOWER(name) ='pippy'
"""

df_filtered = spark.sql(sql_query)
df_filtered.show()

Yields below output.


# Output
+-----+---+
| name|age|
+-----+---+
|Pippy| 20|
+-----+---+

Similarly, you can also use the LIKE or ILIKE operators.


sql_query = """
   SELECT * FROM names_table WHERE LOWER(name) LIKE 'pippy'
"""

# or

sql_query = """
   SELECT * FROM names_table WHERE name ILIKE 'pippy'
"""

Conclusion

In this PySpark article, you have learned how to filter the Dataframe rows by case-insensitive (ignore case) by converting the column value to either lower or uppercase using lower() and upper() functions, respectively and comparing with the value of the same case.

Happy Learning !!