Pandas Filter DataFrame by Substring criteria

You can filter pandas DataFrame by substring criteria using Series.isin(), Series.str.contains()DataFrame.query() and DataFrame.apply() with Lambda function. You can also use the | and ! symbols, and the tilde (~) to negate a statement.

And also using numpy methods np.char.find(), np.vectorize(), DataFrame.query() methods.

1. Quick Examples of Filter pandas DataFrame by SubString Critera

If you are in a hurry below are some quick examples of how to select pandas DataFrame by substring.


# Below are some quick examples.
# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]

# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]

# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]

# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]

# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))

# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]

# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)

# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]

# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]

# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])

# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]

# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)

# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]

# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)

# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]

# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]

# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]

# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])

# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')

Let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)

Yields below output.


   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
3   Python  24000     None
4  PySpark  26000      NaN

2. Using Series.str.contains() to Filter Rows by Substring

Series.str.contains() method in pandas allows you to search a column for a specific substring. The contains() method returns boolean values for the series with True when the original Series value contains the substring and False if not.

Note that str.contains() is a case sensitive, meaning that ‘spark‘ (all in lowercase) and ‘SPARK’ are considered different strings.


# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]
print(df2)

Yields below output.


   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

3. Get Rows that Contain One SubString OR Another Substring

To get all rows that contain one substring or another substring you need to use pipe symbol(|). Use str.contains() and pass the two strings separated by a pipe (|). You can have many values into the data using the symbol (|). This return True when any of the values match.


# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]
print(df2)

Yields below output.


   Courses    Fee Duration
1  PySpark  25000   50days
3   Python  24000     None
4  PySpark  26000      NaN

Note: If your column contains NA/NaN this returns an error.

4. Filter Rows by Substring From Non String Column

pandas.series.astype() function is used to cast a pandas object to a specified data type and then use the contains() method. Note that contains() method operates only on String type hence the type cast is required.


# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]
print(df2)

Yields same output as above

5. Filter Rows that Match a Given String in a Column by isin()

Series.isin( ) function is used to check whether values are contained in Series. Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly. Note that you must always include the value(s) in square brackets even if it is just one.

We then apply this series.isin() to the whole DataFrame to return the rows where the condition was True.


# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]
print(df2)

Yields below output.


  Courses    Fee Duration
0   Spark  22000   30days
2   Spark  23000   30days

NOTE: If you do not want to deal with a mix of upper and lowercase letters, first convert all the column’s elements into lowercase.


# Mix of upper and lowercase letters.
df['Courses'].str.lower().isin(['spark'])

6. Filter Pandas DataFrame of Multiple Columns

We can check for rows where a sub-string is present in two or more given columns by DataFrame.apply() and lambda functions. The apply() method allows you to apply a function along one of the axis of the DataFrame. Lambda functions are defined using the keyword lambda. They can have any number of arguments but only one expression.


# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)
print(df2)

Yields below output.


   Courses    Fee  Duration
0     True  False     False
1     True  False     False
2     True  False     False
3     True  False     False
4     True  False     False

NOTE: To filter pandas DataFrame of multiple columns, create a list of terms, then join them. For example-


# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]
print(df2)

Yields below output.


   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

We can provide a list of strings like isin([‘Spark’,’Python']) and check if a column’s elements match any of them.


# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]
print(df2)

# OutPut:
  Courses    Fee Duration
0   Spark  22000   30days
2   Spark  23000   30days
3  Python  24000     None

7. Using IN Operator to Get SubString of Pandas DataFrame

The IN operator is used to check if a value exists in a sequence or not. Evaluate to true if it finds a variable in the specified sequence and false otherwise.


# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])
print(df2)

8. Get Multiple Whole word of Pandas DataFrame

To Get Multiple whole word of pandas DataFrame  we need to add a word boundary (\b) to the join the pattern.


# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]
print(df2)

Yields below output.


   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN
\b(?:Spark|PySpark)\b

9. By using DataFrame.loc[] and str.contains()

We can also use df.loc[] where we display all the rows but only the columns with the given sub-string.


# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]
print(df2)

10. Using DataFrame.apply() & IN Operator With lambda Function

Use df.apply()lambda, and the IN keyword to check if a certain value is present in a given sequence of strings.


# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)
print(df2)

Yields below output.


0     True
1    False
dtype: bool

11. Get all Rows That Not Contain Given SubStrings by df.loc[]

You can filter DataFrame, where rows of Courses column don’t contain Spark by using a tilde (~) to negate the statement.


# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]
print(df2)

Yields below output.


  Courses    Fee Duration
3  Python  24000     None

12. Other Examples

Using not equal operator to negate the condition.


# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]
print(df2)

Yields below output.


   Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
4  PySpark  26000      NaN

Example-2: Using np.char.find() Function

The numpy.char.find() function returns the lowest index in the string for each element where substring sub is found.


# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]
print(df2)

Example-3: Using np.vectorize() Function

np.vectorized() function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays.


# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])
print(df2)

Example-4: Using DataFrame.query() Method

The DataFrame.query() function is used to filter the columns of a DataFrame with a boolean expression.


# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')
print(df2)

13. Complete Examples to Filter Pandas DataFrame by SubString Critera


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using str.contains() method.
df2 = df[df['Courses'].str.contains("Spark")]

# Get all rows that contain one substring or another substring.
df2=df[df['Courses'].str.contains("PySpark|Python")==True]

# Filter rows of pandas DataFrame by series.astype().
df2=df[df['Courses'].astype(str).str.contains("PySpark|Python")]

# Filter rows that match a given string in a column by isin().
df2=df[df['Courses'].isin(["Spark"])]

# Mix of upper and lowercase letters.
print(df['Courses'].str.lower().isin(['Spark']))

# Using isin() to deal with two strings.
df2=df[df['Courses'].isin(["Spark","Python"])]

# Filter Pandas DataFrame of Multiple columns.
df2=df.apply(lambda col: col.str.contains('Spark|Python', na=False), axis=1)

# Join multiple terms.
terms = ['Spark', 'PySpark']
df2=df[df['Courses'].str.contains('|'.join(terms))]

# Using re.escape() function to get multiple columns.
import re
df2=df[df['Courses'].str.contains('|'.join(map(re.escape, terms)))]

# Using IN operator to get substring of Pandas DataFrame.
df2=(df[['Spark' in x for x in df['Courses']]])

# Get multiple whole word of pandas DataFrame.
terms = ['Spark', 'PySpark']
df2= r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['Courses'].str.contains(df2)]

# Using DataFrame.filter() method.
df2=df.filter(like='Courses', axis=1)

# Using df.loc method.
df2=df.loc[:, df.columns.str.contains('Courses')]

# Using DataFrame.apply() use IN operator with lambda function.
df2=df.apply(lambda x: x['Spark'] in x['Python'], axis=1)

# Get all rows that not contain given substring by df.loc[]
df2= df[~df['Courses'].str.contains('Spark|PySpark')]

# String with Special characters.
df2=df[df['Courses'].str.find("Spark") != -1]

# Using np.char.find() function.
df2=df[np.char.find(df['Courses'].values.astype(str), 'Spark') > -1]

# Using np.vectorize() function.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df['Courses'], 'Spark')
df2=(df[f(df['Courses'], 'Spark')])

# Using DataFrame.query() method.
df2=df.query('Courses.str.contains("Spark")', engine='python')

Conclusion

In this article, you have learned how to filter pandas DataFrame by substring using series.isin() and Series.str.contains() , DataFrame.apply() and Lambda functions. We also used the | and OR symbols, and the tilde (~) to negate a statement.

Related Articles

References

Leave a Reply

You are currently viewing Pandas Filter DataFrame by Substring criteria