You can filter pandas DataFrame by substring criteria using Series.isin()
, Series.str.contains()
, DataFrame.query()
and DataFrame.apply()
with Lambda
function. You can also use the | and ! symbols, and the tilde (~) to negate a statement.
And also using numpy methods np.char.find()
, np.vectorize()
, DataFrame.query()
methods.
Key Points –
- Use the
str.contains()
a method in Pandas to filter a DataFrame based on substring criteria within a specific column. - The
str.contains()
method creates a boolean mask, where each element in the specified column is checked for the presence of the given substring. - The
case
parameter can be set toFalse
withinstr.contains()
for a case-insensitive substring search, ensuring matches regardless of letter case. - Filtering a DataFrame using substring criteria can have performance implications, especially on large datasets. Consider the efficiency of your chosen substring search, and be mindful of potential computational costs.
- Pandas’
str.contains()
method supports regular expressions for advanced string matching. Explore regular expression patterns for more complex and flexible substring filtering options when needed.
1. Quick Examples of Filter Pandas DataFrame by SubString Criteria
If you are in a hurry below are some quick examples of how to select Pandas DataFrame by substring.
# Below are some quick examples.
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]
# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))
# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
Let’s create a pandas DataFrame with a few rows and columns, execute these examples, and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
3 Python 24000 None
4 PySpark 26000 NaN
2. Using Series.str.contains() to Filter Rows by Substring
Series.str.contains()
method in pandas allows you to search a column for a specific substring. The contains()
method returns boolean values for the series with True when the original Series value contains the substring and False if not.
Note that str.contains()
is a case sensitive, meaning that ‘spark‘ (all in lowercase) and ‘SPARK’ are considered different strings.
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
3. Get Rows that Contain One SubString OR Another Substring
To get all rows that contain one substring or
another substring you need to use pipe
symbol(|)
. Use str.contains()
and pass the two strings separated by a pipe (|)
. You can have many values into the data using the symbol (|
). This returns True
when any of the values match.
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
1 PySpark 25000 50days
3 Python 24000 None
4 PySpark 26000 NaN
Note: If your column contains NA/NaN this returns an error.
4. Filter Rows by Substring From Non-String Column
pandas.series.astype() function is used to cast a pandas object to a specified data type and then use the contains()
method. Note that contains()
method operates only on String type hence the type cast is required.
# Filter rows of pandas DataFrame by series.astype()
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
print(df2)
Yields the same output as above
5. Filter Rows that Match a Given String in a Column by isin()
Series.isin()
function is used to check whether values are contained in Series
. Return a boolean Series
showing whether each element in the Series matches an element in the passed sequence of values exactly. Note that you must always include the value(s) in square brackets even if it is just one.
We then apply this series.isin()
to the whole DataFrame to return the rows where the condition was True
.
# Filter rows that match a given string in a column by isin()
df2=df[df['Courses'].isin(["Spark"])]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
NOTE: If you do not want to deal with a mix of upper and lowercase letters, first convert all the column’s elements into lowercase.
# Mix of upper and lowercase letters.
df['Courses'].str.lower().isin(['spark'])
6. Filter Pandas DataFrame of Multiple Columns
We can check for rows where a sub-string is present in two or more given columns by DataFrame.apply()
and lambda
functions. The apply()
method allows you to apply a function along one of the axis of the DataFrame. Lambda
functions are defined using the keyword lambda. They can have any number of arguments but only one expression.
# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
NOTE: To filter pandas DataFrame of multiple columns, create a list of terms, then join
them. For example-
# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
We can provide a list of strings like isin([‘Spark’,’Python'])
and check if a column’s elements match any of them.
# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
print(df2)
# Output:
# Courses Fee Duration
# 0 Spark 22000 30days
# 2 Spark 23000 30days
# 3 Python 24000 None
7. Using IN Operator to Get SubString of Pandas DataFrame
The IN
operator is used to check if a value exists in a sequence or not. Evaluate to true if it finds a variable in the specified sequence and false otherwise.
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
print(df2)
8. Get Multiple Whole word of Pandas DataFrame
To Get Multiple whole word of pandas DataFrame we need to add a word boundary (\b
) to the join the pattern.
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
\b(?:Spark|PySpark)\b
9. By using DataFrame.loc[] and str.contains()
We can also use df.loc[]
where we display all the rows but only the columns with the given sub-string.
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
print(df2)
10. Using DataFrame.apply() & IN Operator With lambda Function
Use df.apply()
, lambda
, and the IN
keyword to check if a certain value is present in a given sequence of strings.
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
print(df2)
Yields below output.
# Output:
0 True
1 False
dtype: bool
11. Get all Rows That Not Contain Given SubStrings by df.loc[]
You can filter DataFrame, where rows of Courses column don’t contain Spark
by using a tilde (~
) to negate the statement.
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
3 Python 24000 None
12. Other Examples
Using not equal operator to negate the condition.
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
print(df2)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
Example-2: Using np.char.find() Function
The numpy.char.find()
function returns the lowest index
in the string for each element where substring sub is found.
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
print(df2)
Example-3: Using np.vectorize() Function
np.vectorized()
function which takes a nested sequence of objects or numpy
arrays as inputs and returns a single numpy array or a tuple of numpy arrays.
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
print(df2)
Example-4: Using DataFrame.query() Method
The DataFrame.query() function is used to filter the columns of a DataFrame with a boolean expression.
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
print(df2)
13. Complete Examples
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
# Filter rows that match a given string in a column by isin()
df2=df[df['Courses'].isin(["Spark"])]
# Mix of upper and lowercase letters
print(df['Courses'].str.lower().isin(['Spark']))
# Using isin() to deal with two strings
df2=df[df['Courses'].isin(["Spark","Python"])]
# Filter Pandas DataFrame of Multiple columns
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
# Join multiple terms
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
Frequently Asked Questions on Filter DataFrame by Substring Criteria
Use the str.contains()
method on the desired column. This method creates a boolean mask, and you can then apply it to the DataFrame to filter rows based on the substring criterion.
By default, the substring search using str.contains()
in Pandas is case-sensitive. This means that if you’re searching for a specific substring, the method will distinguish between uppercase and lowercase letters. If you want to perform a case-insensitive search, you can use the case
parameter and set it to False
.
You can use logical operators like &
(AND) or |
(OR) to combine multiple substring conditions. This allows you to create more complex filters based on various substring criteria.
If you want to use advanced string patterns for filtering in Pandas, you can leverage regular expressions. The str.contains()
method in Pandas supports regular expressions for more intricate and flexible substring matching. Regular expressions allow you to define complex patterns for text matching.
Missing values (NaN) can be handled using the na
parameter within the str.contains()
method. Set na=False
to treat missing values as False
during the substring search, and set na=True
to consider them as True
.
For exact matches, consider using the ==
equality operator rather than str.contains()
. This ensures that only rows with the exact specified substring are included in the filtered DataFrame.
Conclusion
In this article, you have learned how to filter pandas DataFrame by substring criteria using <code>series.isin()
, Series.str.contains()
, DataFrame.apply()
and Lambda
functions. We also used the |
and OR
symbols, and the tilde (~
) to negate a statement.
Related Articles
- Pandas Filter by Index
- Pandas Series filter() Function
- Pandas filter by column value
- How to Use NOT IN Filter in Pandas
- Pandas Filter DataFrame Rows on Dates
- Pandas Filter Rows Using IN Like SQL
- Pandas Filter DataFrame Rows on Dates
- Pandas Filter DataFrame by Multiple Conditions
- pandas DataFrame filter() – Usage & Examples
- Apply Multiple Filters to Pandas DataFrame or Series
- Pandas Operator Chaining to Filter DataFrame Rows
- How to Union Pandas DataFrames using Concat?
- Pandas Filter Rows with NAN Value from DataFrame Column