You can filter pandas DataFrame by substring criteria using Series.isin()
, Series.str.contains()
, DataFrame.query()
and DataFrame.apply()
with Lambda
function. You can also use the | and ! symbols, and the tilde (~) to negate a statement.
And also using numpy methods np.char.find()
, np.vectorize()
, DataFrame.query()
methods.
1. Quick Examples of Filter pandas DataFrame by SubString Critera
If you are in a hurry below are some quick examples of how to select pandas DataFrame by substring.
# Below are some quick examples.
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]
# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))
# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
Let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
3 Python 24000 None
4 PySpark 26000 NaN
2. Using Series.str.contains() to Filter Rows by Substring
Series.str.contains()
method in pandas allows you to search a column for a specific substring. The contains()
method returns boolean values for the series with True when the original Series value contains the substring and False if not.
Note that str.contains()
is a case sensitive, meaning that ‘spark‘ (all in lowercase) and ‘SPARK’ are considered different strings.
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
print(df2)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
3. Get Rows that Contain One SubString OR Another Substring
To get all rows that contain one substring or
another substring you need to use pipe
symbol(|)
. Use str.contains()
and pass the two strings separated by a pipe (|)
. You can have many values into the data using the symbol (|). This return True
when any of the values match.
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
print(df2)
Yields below output.
Courses Fee Duration
1 PySpark 25000 50days
3 Python 24000 None
4 PySpark 26000 NaN
Note: If your column contains NA/NaN this returns an error.
4. Filter Rows by Substring From Non String Column
pandas.series.astype() function is used to cast a pandas object to a specified data type and then use the contains()
method. Note that contains() method operates only on String type hence the type cast is required.
# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
print(df2)
Yields same output as above
5. Filter Rows that Match a Given String in a Column by isin()
Series.isin( )
function is used to check whether values are contained in Series
. Return a boolean Series
showing whether each element in the Series matches an element in the passed sequence of values exactly. Note that you must always include the value(s) in square brackets even if it is just one.
We then apply this series.isin()
to the whole DataFrame to return the rows where the condition was True
.
# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]
print(df2)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
NOTE: If you do not want to deal with a mix of upper and lowercase letters, first convert all the column’s elements into lowercase.
# Mix of upper and lowercase letters.
df['Courses'].str.lower().isin(['spark'])
6. Filter Pandas DataFrame of Multiple Columns
We can check for rows where a sub-string is present in two or more given columns by DataFrame.apply()
and lambda
functions. The apply()
method allows you to apply a function along one of the axis of the DataFrame. Lambda
functions are defined using the keyword lambda. They can have any number of arguments but only one expression.
# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
print(df2)
Yields below output.
Courses Fee Duration
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
NOTE: To filter pandas DataFrame of multiple columns, create a list of terms, then join
them. For example-
# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
print(df2)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
We can provide a list of strings like isin([‘Spark’,’Python'])
and check if a column’s elements match any of them.
# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
print(df2)
# OutPut:
Courses Fee Duration
0 Spark 22000 30days
2 Spark 23000 30days
3 Python 24000 None
7. Using IN Operator to Get SubString of Pandas DataFrame
The IN
operator is used to check if a value exists in a sequence or not. Evaluate to true if it finds a variable in the specified sequence and false otherwise.
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
print(df2)
8. Get Multiple Whole word of Pandas DataFrame
To Get Multiple whole word of pandas DataFrame we need to add a word boundary (\b
) to the join the pattern.
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
print(df2)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
\b(?:Spark|PySpark)\b
9. By using DataFrame.loc[] and str.contains()
We can also use df.loc[]
where we display all the rows but only the columns with the given sub-string.
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
print(df2)
10. Using DataFrame.apply() & IN Operator With lambda Function
Use df.apply()
, lambda
, and the IN
keyword to check if a certain value is present in a given sequence of strings.
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
print(df2)
Yields below output.
0 True
1 False
dtype: bool
11. Get all Rows That Not Contain Given SubStrings by df.loc[]
You can filter DataFrame, where rows of Courses column don’t contain Spark
by using a tilde (~) to negate the statement.
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
print(df2)
Yields below output.
Courses Fee Duration
3 Python 24000 None
12. Other Examples
Using not equal operator to negate the condition.
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
print(df2)
Yields below output.
Courses Fee Duration
0 Spark 22000 30days
1 PySpark 25000 50days
2 Spark 23000 30days
4 PySpark 26000 NaN
Example-2: Using np.char.find() Function
The numpy.char.find()
function returns the lowest index
in the string for each element where substring sub is found.
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
print(df2)
Example-3: Using np.vectorize() Function
np.vectorized()
function which takes a nested sequence of objects or numpy
arrays as inputs and returns a single numpy array or a tuple of numpy arrays.
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
print(df2)
Example-4: Using DataFrame.query() Method
The DataFrame.query() function is used to filter the columns of a DataFrame with a boolean expression.
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
print(df2)
13. Complete Examples to Filter Pandas DataFrame by SubString Critera
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]
# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))
# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]
# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)
# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
Conclusion
In this article, you have learned how to filter pandas DataFrame by substring using series.isin()
and Series.str.contains()
, DataFrame.apply()
and Lambda
functions. We also used the | and OR symbols, and the tilde (~) to negate a statement.
Related Articles
- Pandas Find Unique Values From Columns
- Convert Row to Column Header in Pandas DataFrame
- Retrieve Number of Rows From Pandas DataFrame
- Replace Column value in Pandas DataFrame
- Rename Specific Columns in Pandas
- Pandas Filter DataFrame by Multiple Conditions
- Get First N Rows of Pandas DataFrame
- Pandas Extract Month and Year from Datetime
- Pandas Extract Column Value Based on Another Column