• Post author:
  • Post category:Pandas
  • Post last modified:December 11, 2024
  • Reading time:17 mins read
You are currently viewing Pandas Filter Rows Using IN Like SQL

You can filter/select rows from Pandas DataFrame using IN (ISIN) operator like SQL by using pandas.Series.isin(), DataFrame.query() methods. In this article, I will explain how to filter a single column, how to filter multiple columns, how to filter based on conditions, and lambda functions using IN operator with examples in Pandas DataFrame.

Advertisements

Key Points –

  • Pandas provides the isin() method to filter rows based on whether the values in a column are part of a specified list or array, mimicking the SQL IN clause.
  • The isin() method can be applied directly to a single column to filter rows where the column’s values match any value in a given list.
  • You can filter rows based on multiple columns using isin() on each column and combining them with logical conditions.
  • The query() method allows you to use IN-like filtering within string expressions, supporting dynamic and complex conditions.
  • For filtering rows across multiple columns, you can combine isin() with .any(axis=1) or .all(axis=1) to check if any or all columns match the condition.
  • In query(), you can reference external Python variables in the query string by prefixing them with the @ symbol, making the query more dynamic and flexible.

Quick Examples of Using IN Like SQL

If you are in a hurry, below are some quick examples of how to use IN operator in pandas DataFrame.


# Quick examples of using iN like SQL

# Filtering a single column with pandas isin
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a boolean array
df2=df.Courses.isin(Courses_to_keep)

# Filtering Multiple Columns With Pandas Isin
df2 = df[df[['Courses', 'Fee']].isin(['Spark', 20000]).any(axis=1)]

# Filtering Using Pandas isin Not matching condition
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

Now, let’s create a pandas DataFrame from Dict object and execute these examples and validate the result. Our DataFrame contains column names CoursesFeeDuration, and Discount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000   30days      1000
r2  PySpark  25000   40days      2300
r3   Python  22000   35days      1200
r4   pandas  30000   50days      2000

Filter Using isin() similar to IN in SQL

Similar to SQL IN operators, you can filter rows from pandas DataFrame by checking column values in a list. pandas.Series.isin() function is used to check whether the elements in Series contain specified values. It also returns a boolean Series whether each element in the Series matches an element in the passed sequence of values exactly.

For example, let’s say you wanted to select rows that contain a list (Courses_to_keep) of values in a column Courses. To accomplish this, you could write:


# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
r1   Spark  20000   30days      1000
r3  Python  22000   35days      1200

For a better understanding of what’s actually going on here. Let’s run df.series.isin() which returns a boolean array.


# To return a Boolean array
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Output:
# r1     True
# r2    False
# r3     True
# r4    False
# Name: Courses, dtype: bool

Filter Row By Multiple Columns With Pandas isin()

Let’s take a look at filter on multiple columns by .isin() method. It adds a little bit of complexity to the equation.


# Filtering Multiple Columns With Pandas isin().
df2 = df[df[['Courses', 'Fee']].isin(['Spark',30000]).any(axis=1)]
print(df2)

# Output:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r4  pandas  30000   50days      2000

Filter Using isin() Not Matching Condition

Similar to .isin() method to SQL’s IN statement, we can use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Using pandas isin not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]
print(df2)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r2  PySpark  25000   40days      2300
r4   pandas  30000   50days      2000

Filter Using IN operator and DataFrame.query()

The Pandas DataFrame.query() function is used to query the columns of a DataFrame with a boolean expression. In this method you can use in operator similar to SQL expression.


# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Python  22000   35days      1200

NOTE: The query() method is similar to SQL’s query statement, we can also use the Pandas unary operator (~) to perform a NOT IN selection.


# Filtering Pandas DataFrame by query() method perform NOT IN
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")
print(df2)

# Output:
#     Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r4   pandas  30000   50days      2000

Filter Pandas DataFrame by Lambda Function

A lambda function is a small function containing a single expression. It can also act as anonymous functions where they don’t require any name.


# Filtering pandas DataFrame by lambda function.
Courses=["Spark","Python"]
criterion=lambda row:row['Courses']not in Courses
df=df[df.apply(criterion,axis=1)]
print(df)

Yields output same as above.

Complete Examples of Filter DataFrame Using IN


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
index_labels=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Filtering a single column with pandas Isin.
Courses_to_keep=["Spark","Python"]
df2=df[df.Courses.isin(Courses_to_keep)]

# To return a Boolean array.
df2=df.Courses.isin(Courses_to_keep)
print(df2)

# Filtering Multiple Columns With Pandas Isin.
df2 = df[df[['Courses', 'Fee']].isin(['Spark', '20000']).any(axis=1)]

# Filtering Using Pandas isin Not matching condition.
Courses_to_keep=["Spark","Python"]
df2=df[~df.Courses.isin(Courses_to_keep)]

# Filtering Pandas DataFrame by df.query() method.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses in @Courses_to_keep")

# Filtering Pandas DataFrame by query() method perform NOT IN.
Courses_to_keep=["Spark","Python"]
df2=df.query("Courses not in @Courses_to_keep")

FAQ on Pandas Filter Rows Using IN Like SQL

How can I filter rows based on a single column using IN?

To filter rows in a Pandas DataFrame based on a single column using IN logic (like SQL), you can use the .isin() method.

Can I get a Boolean array instead of filtered rows?

You can get a Boolean array instead of filtered rows by applying the .isin() method directly to the column. This array indicates whether each row satisfies the condition.

How do I filter rows based on multiple columns?

To filter rows based on multiple columns in Pandas, you can use the .isin() method in combination with .any(axis=1) or .all(axis=1) depending on whether any or all conditions must be satisfied.

How do I filter rows with a NOT IN condition?

To filter rows with a NOT IN condition in Pandas, you can use the ~ operator (bitwise NOT) with the .isin() method. This negates the condition, returning rows that do not match the values in the list.

Which method is faster?

Performance depends on the size of the DataFrame and the filtering logic. For simple conditions, .isin() is usually faster, but query() provides better readability for complex filters.

Conclusion

In this article, you have learned about pd.Series.isin(), DataFrame.query() methods to filter/select rows using IN operator similar to SQL. Also, you have learned how to filter using a single column, how to filter multiple columns, and how to filter based on conditions not being true and lambda functions.

References

Leave a Reply