• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:10 mins read
You are currently viewing Pandas Filter Rows Using IN Like SQL

You can filter/select rows from Pandas DataFrame using IN (ISIN) operator like SQL by using pandas.Series.isin(), DataFrame.query() methods. In this article, I will explain how to filter a single column, how to filter multiple columns, how to filter based on conditions, and lambda functions using IN operator with examples in Pandas DataFrame.

Advertisements

1. Quick Examples of Using IN Like SQL

If you are in a hurry, below are some quick examples of how to use IN operator in pandas DataFrame.


# Below are some Quick examples.

# Filtering a single column with pandas isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Filtering Multiple Columns With Pandas Isin.
df2 = df[df[['Courses', 'Fee']].isin(['Spark', 20000]).any(axis=1)]

# Filtering Using Pandas isin Not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

Now, let’s create a pandas DataFrame from Dict object and execute these examples and validate the result. Our DataFrame contains column names CoursesFeeDuration, and Discount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000   30days      1000
r2  PySpark  25000   40days      2300
r3   Python  22000   35days      1200
r4   pandas  30000   50days      2000

2. Filter Using isin() similar to IN in SQL

Similar to SQL IN operators, you can filter rows from pandas DataFrame by checking column values in a list. pandas.Series.isin() function is used to check whether the elements in Series contain specified values. It also returns a boolean Series whether each element in the Series matches an element in the passed sequence of values exactly.

For example, let’s say you wanted to select rows that contain a list (Courses_to_keep) of values in a column Courses. To accomplish this, you could write:


# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
r1   Spark  20000   30days      1000
r3  Python  22000   35days      1200

For a better understanding of what’s actually going on here. Let’s run df.series.isin() which returns a boolean array.


# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Output:
# r1     True
# r2    False
# r3     True
# r4    False
# Name: Courses, dtype: bool

3. Filter Row By Multiple Columns With Pandas isin()

Let’s take a look at filter on multiple columns by .isin() method. It adds a little bit of complexity to the equation.


# Filtering Multiple Columns With Pandas isin().
df2 = df[df[['Courses', 'Fee']].isin(['Spark',30000]).any(axis=1)]
print(df2)

# Output:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r4  pandas  30000   50days      2000

4. Filter Using isin() Not Matching Condition

Similar to .isin() method to SQL’s IN statement, we can use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Using pandas isin not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r2  PySpark  25000   40days      2300
r4   pandas  30000   50days      2000

5. Filter Using IN operator and DataFrame.query()

The Pandas DataFrame.query() function is used to query the columns of a DataFrame with a boolean expression. In this method you can use in operator similar to SQL expression.


# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Python  22000   35days      1200

NOTE: The query() method is similar to SQL’s query statement, we can also use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")
print(df2)

# Output:
#     Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r4   pandas  30000   50days      2000

6. Filter Pandas DataFrame by Lambda Function

A lambda function is a small function containing a single expression. It can also act as anonymous functions where they don’t require any name.


# Filtering pandas DataFrame by lambda function.
Courses=["Spark","Python"]
criterion=lambda row:row['Courses']not in Courses
df=df[df.apply(criterion,axis=1)]
print(df)

Yields output same as above.

7. Complete Examples of Filter DataFrame Using IN


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Filtering Multiple Columns With Pandas Isin.
df2 = df[df[['Courses', 'Fee']].isin(['Spark', '20000']).any(axis=1)]

# Filtering Using Pandas isin Not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

Conclusion

In this article, you have learned about pd.Series.isin(), DataFrame.query() methods to filter/select rows using IN operator similar to SQL. Also, you have learned how to filter using a single column, how to filter multiple columns, and how to filter based on conditions not being true and lambda functions.

References

Leave a Reply