Pandas Filter Rows Using IN and NOT IN Like SQL

You can filter/select rows from Pandas DataFrame using IN (ISIN) and NOT IN operators like SQL by using pandas.Series.isin(), DataFrame.query() methods. In this article, I will explain how to filter a single column, how to filter multiple columns, how to filter based on conditions, and lambda functions using IN and Not IN operators with examples in pandas DataFrame.

1. Quick Examples of Using IN and NOT IN Like SQL

If you are in a hurry, below are some quick examples of how to use IN and NOT IN operators in pandas DataFrame.


# Below are some Quick examples.
# Filtering a single column with pandas isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Filtering Multiple Columns With Pandas Isin.
df2 = df[df[['Courses', 'Fee']].isin(['Spark', 20000]).any(axis=1)]

# Filtering Using Pandas isin Not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

Now, let’s create a pandas DataFrame from Dict object and execute these examples and validate results. Our DataFrame contains column names CoursesFeeDuration, and Discount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

Yields below output.


    Courses    Fee Duration  Discount
r1    Spark  20000   30days      1000
r2  PySpark  25000   40days      2300
r3   Python  22000   35days      1200
r4   pandas  30000   50days      2000

2. Filter Using isin() similar to IN in SQL

Similar to SQL IN and NOT IN operators, you can filter rows from pandas DataFrame by checking column values in a list. pandas.Series.isin() function is used to check whether the elements in Series contain specified values. It also returns a boolean Series whether each element in the Series matches an element in the passed sequence of values exactly.

For example, let’s say you wanted to select rows that contain a list (Courses_to_keep) of values in a column Courses. To accomplish this, you could write:


# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


   Courses    Fee Duration  Discount
r1   Spark  20000   30days      1000
r3  Python  22000   35days      1200

For a better understanding of what’s actually going on here. Let’s run df.series.isin() which returns a boolean array.


# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# OutPut:
r1     True
r2    False
r3     True
r4    False
Name: Courses, dtype: bool

3. Filter Row By Multiple Columns With Pandas isin()

Let’s take a look at filter on multiple columns by .isin() method. It adds a little bit of complexity to the equation.


# Filtering Multiple Columns With Pandas isin().
df2 = df[df[['Courses', 'Fee']].isin(['Spark',30000]).any(axis=1)]
print(df2)

# Output:
   Courses    Fee Duration  Discount
r1   Spark  20000   30days      1000
r4  pandas  30000   50days      2000

4. Filter Using isin() Not Matching Condition

Similar to .isin() method to SQL’s IN statement, we can use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Using pandas isin not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


    Courses    Fee Duration  Discount
r2  PySpark  25000   40days      2300
r4   pandas  30000   50days      2000

5. Filter Using IN operator and DataFrame.query()

The query() function is used to query the columns of a DataFrame with a boolean expression. In this method you can use in operator similar to SQL expression.


# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")
print(df2)

# Output:
   Courses    Fee Duration  Discount
r1   Spark  20000   30days      1000
r3  Python  22000   35days      1200

NOTE: The query() method to SQL’s query statement, we can use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")
print(df2)

# Output:
    Courses    Fee Duration  Discount
r2  PySpark  25000   40days      2300
r4   pandas  30000   50days      2000

6. Filter Pandas DataFrame by Lambda Function

A lambda function is a small function containing a single expression. It can also act as anonymous functions where they don’t require any name.


# Filtering pandas DataFrame by lambda function.
Courses=["Spark","Python"]
criterion=lambda row:row['Courses']not in Courses
df=df[df.apply(criterion,axis=1)]
print(df)

Yields output same as above.

7. Complete Examples of Filter DataFrame Using IN and NOT IN


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Filtering Multiple Columns With Pandas Isin.
df2 = df[df[['Courses', 'Fee']].isin(['Spark', '20000']).any(axis=1)]

# Filtering Using Pandas isin Not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

Conclusion

In this article, you have learned about pd.Series.isin(), DataFrame.query() methods to filter/select rows using IN and NOT IN operators similar to SQL. Also, you have learned how to filter using a single column, how to filter multiple columns, and how to filter based on conditions not being true and lambda functions.

You May Also Like

References

Leave a Reply

Pandas Filter Rows Using IN and NOT IN Like SQL