Pandas Filter DataFrame by Substring Criteria

  • Post author:
  • Post category:Pandas
  • Post last modified:January 28, 2024
  • Reading time:26 mins read

You can filter pandas DataFrame by substring criteria using Series.isin(), Series.str.contains()DataFrame.query() and DataFrame.apply() with Lambda function. You can also use the | and ! symbols, and the tilde (~) to negate a statement.

And also using numpy methods np.char.find(), np.vectorize(), DataFrame.query() methods.

Key Points –

  • Use the str.contains() a method in Pandas to filter a DataFrame based on substring criteria within a specific column.
  • The str.contains() method creates a boolean mask, where each element in the specified column is checked for the presence of the given substring.
  • The case parameter can be set to False within str.contains() for a case-insensitive substring search, ensuring matches regardless of letter case.
  • Filtering a DataFrame using substring criteria can have performance implications, especially on large datasets. Consider the efficiency of your chosen substring search, and be mindful of potential computational costs.
  • Pandas’ str.contains() method supports regular expressions for advanced string matching. Explore regular expression patterns for more complex and flexible substring filtering options when needed.

1. Quick Examples of Filter Pandas DataFrame by SubString Criteria

If you are in a hurry below are some quick examples of how to select Pandas DataFrame by substring.


# Below are some quick examples.

# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]

# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]

# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]

# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]

# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))

# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]

# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)

# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]

# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]

# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])

# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]

# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)

# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]

# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)

# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]

# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]

# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]

# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])

# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')

Let’s create a pandas DataFrame with a few rows and columns, execute these examples, and validate results. Our DataFrame contains column names CoursesFee and Duration.


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
3   Python  24000     None
4  PySpark  26000      NaN

2. Using Series.str.contains() to Filter Rows by Substring

Series.str.contains() method in pandas allows you to search a column for a specific substring. The contains() method returns boolean values for the series with True when the original Series value contains the substring and False if not.

Note that str.contains() is a case sensitive, meaning that ‘spark‘ (all in lowercase) and ‘SPARK’ are considered different strings.


# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

3. Get Rows that Contain One SubString OR Another Substring

To get all rows that contain one substring or another substring you need to use pipe symbol(|). Use str.contains() and pass the two strings separated by a pipe (|). You can have many values into the data using the symbol (|). This returns True when any of the values match.


# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration
1  PySpark  25000   50days
3   Python  24000     None
4  PySpark  26000      NaN

Note: If your column contains NA/NaN this returns an error.

4. Filter Rows by Substring From Non-String Column

pandas.series.astype() function is used to cast a pandas object to a specified data type and then use the contains() method. Note that contains() method operates only on String type hence the type cast is required.


# Filter rows of pandas DataFrame by series.astype()
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
print(df2)

Yields the same output as above

5. Filter Rows that Match a Given String in a Column by isin()

Series.isin() function is used to check whether values are contained in Series. Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly. Note that you must always include the value(s) in square brackets even if it is just one.

We then apply this series.isin() to the whole DataFrame to return the rows where the condition was True.


# Filter rows that match a given string in a column by isin()
df2=df[df['Courses'].isin(["Spark"])]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration
0   Spark  22000   30days
2   Spark  23000   30days

NOTE: If you do not want to deal with a mix of upper and lowercase letters, first convert all the column’s elements into lowercase.


# Mix of upper and lowercase letters.
df['Courses'].str.lower().isin(['spark'])

6. Filter Pandas DataFrame of Multiple Columns

We can check for rows where a sub-string is present in two or more given columns by DataFrame.apply() and lambda functions. The apply() method allows you to apply a function along one of the axis of the DataFrame. Lambda functions are defined using the keyword lambda. They can have any number of arguments but only one expression.


# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
print(df2)

Yields below output.


# Output:
   Courses    Fee  Duration
0     True  False     False
1     True  False     False
2     True  False     False
3     True  False     False
4     True  False     False

NOTE: To filter pandas DataFrame of multiple columns, create a list of terms, then join them. For example-


# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

We can provide a list of strings like isin([‘Spark’,’Python']) and check if a column’s elements match any of them.


# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
print(df2)

# Output:
#  Courses    Fee Duration
# 0   Spark  22000   30days
# 2   Spark  23000   30days
# 3  Python  24000     None

7. Using IN Operator to Get SubString of Pandas DataFrame

The IN operator is used to check if a value exists in a sequence or not. Evaluate to true if it finds a variable in the specified sequence and false otherwise.


# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
print(df2)

8. Get Multiple Whole word of Pandas DataFrame

To Get Multiple whole word of pandas DataFrame  we need to add a word boundary (\b) to the join the pattern.


# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN
\b(?:Spark|PySpark)\b

9. By using DataFrame.loc[] and str.contains()

We can also use df.loc[] where we display all the rows but only the columns with the given sub-string.


# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
print(df2)

10. Using DataFrame.apply() & IN Operator With lambda Function

Use df.apply()lambda, and the IN keyword to check if a certain value is present in a given sequence of strings.


# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
print(df2)

Yields below output.


# Output:
0     True
1    False
dtype: bool

11. Get all Rows That Not Contain Given SubStrings by df.loc[]

You can filter DataFrame, where rows of Courses column don’t contain Spark by using a tilde (~) to negate the statement.


# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration
3  Python  24000     None

12. Other Examples

Using not equal operator to negate the condition.


# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

Example-2: Using np.char.find() Function

The numpy.char.find() function returns the lowest index in the string for each element where substring sub is found.


# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
print(df2)

Example-3: Using np.vectorize() Function

np.vectorized() function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays.


# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
print(df2)

Example-4: Using DataFrame.query() Method

The DataFrame.query() function is used to filter the columns of a DataFrame with a boolean expression.


# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
print(df2)

13. Complete Examples


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]

# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]

# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]

# Filter rows that match a given string in a column by isin()
df2=df[df['Courses'].isin(["Spark"])]

# Mix of upper and lowercase letters
print(df['Courses'].str.lower().isin(['Spark']))

# Using isin() to deal with two strings
df2=df[df['Courses'].isin(["Spark","Python"])]

# Filter Pandas DataFrame of Multiple columns
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)

# Join multiple terms
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]

# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]

# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])

# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]

# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)

# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]

# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)

# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]

# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]

# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]

# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])

# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')

Frequently Asked Questions on Filter DataFrame by Substring Criteria

How can I filter a Pandas DataFrame based on a substring in a specific column?

Use the str.contains() method on the desired column. This method creates a boolean mask, and you can then apply it to the DataFrame to filter rows based on the substring criterion.

Is the substring search case-sensitive?

By default, the substring search using str.contains() in Pandas is case-sensitive. This means that if you’re searching for a specific substring, the method will distinguish between uppercase and lowercase letters. If you want to perform a case-insensitive search, you can use the case parameter and set it to False.

Can I filter based on multiple substrings?

You can use logical operators like & (AND) or | (OR) to combine multiple substring conditions. This allows you to create more complex filters based on various substring criteria.

What if I want to use advanced string patterns for filtering?

If you want to use advanced string patterns for filtering in Pandas, you can leverage regular expressions. The str.contains() method in Pandas supports regular expressions for more intricate and flexible substring matching. Regular expressions allow you to define complex patterns for text matching.

How do I handle missing values during substring filtering?

Missing values (NaN) can be handled using the na parameter within the str.contains() method. Set na=False to treat missing values as False during the substring search, and set na=True to consider them as True.

What if I want to filter based on exact matches, not substrings?

For exact matches, consider using the == equality operator rather than str.contains(). This ensures that only rows with the exact specified substring are included in the filtered DataFrame.

Conclusion

In this article, you have learned how to filter pandas DataFrame by substring criteria using <code>series.isin()Series.str.contains() , DataFrame.apply() and Lambda functions. We also used the | and OR symbols, and the tilde (~) to negate a statement.

References

Leave a Reply