Pandas Filter Rows using DataFrame.query() Method

Pandas DataFrame.query() method is used to filter the rows based on the expression (single or multiple column conditions) provided and returns a new DataFrame after applying the column filter. In case if you wanted to update the existing referring DataFrame use inplace=True argument.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples explaining how to filter the rows.

1. Quick Examples of Filter DataFrame Rows Based on Column Values

If you are in hurry, just glance at the below examples to try it out.


# Filter Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

#Using variable
value='Spark'
df2=df.query("Courses == @value")

#inpace
df.query("Courses == 'Spark'",inplace=True)

#Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Filter Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better.

2. Create a Pandas DataFrame With Sample Data

In order to explain several examples of how to select rows from the Dataframe base on multiple conditions, first create a DataFrame.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

Note that the above DataFrame also contains None and Nan values on Duration column that I would be using in my examples below to select rows that has None & Nan values or select ignoring these values.

3. Using DataFrame.query() to Filter Pandas DataFrame Rows

Below is a syntax of DataFrame.query() method.


# query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
  • expr – expression takes conditions to filter rows
  • inplace – Defaults to False. When set toTrue, it updates the referring DataFrame and query() method returns None.
  • **kwargs –  Keyword arguments that works with eval()

DataFrame.query() takes condition in expression to select rows from a DataFrame. This expression can have one or multiple conditions.


# Filter all rows with Courses rquals 'Spark'
df2=df.query("Courses == 'Spark'")
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  22000   30days      1000

In case if you wanted to use a variable in the expression, use @ character.


# Filter Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print(df2)

If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print(df)

If you wanted to select based on column value not equals then use != operator.


# not equals condition
df2=df.query("Courses != 'Spark'")

Yields below output.


   Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

4. Select Rows Based on List of Column Values

If you have values in a python list and wanted to select the rows based on the list of values, use in operator.


# Filter Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

You can also write with a list of values in a python variable.


# Filter Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))

To select rows that are not in a list of column values can be done using not in operator.


# Filter Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))

If you have column names with special characters using column name surrounded by tick ` character .


# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))

5. Using Multiple Conditions to Filter Rows from Pandas DataFrame

In Pandas or any table-like structures, most of the time we would need to filter the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.


# Filter by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))

Yields below output.


  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

6. Filter Rows Using Pandas DataFrame.apply() & Lambda Function

DataFrame.apply() method is used to apply the lambda function row-by-row and return the rows that matched with the values.


# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

7. Filter Rows that has None & Nan using dropna() Method

In case if you wanted to filter and ignore rows that have None or nan on column values, use DataFrame.dropna() method.


# filter rows by ignoreing columns that have None & Nan values
print(df.dropna())

Yields below output


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   30days      1000

In case if you wanted to drop columns when column values are None or nan. To delete columns, I have covered some examples on how to drop Pandas DataFrame columns.


# Filter all column that have None or NaN
print(df.dropna(axis='columns'))

Yields below output.


   Courses    Fee  Discount
0    Spark  22000      1000
1  PySpark  25000      2300
2   Hadoop  23000      1000
3   Python  24000      1200
4   Pandas  26000      2500

8. Other Examples of Filter Rows Based On Column Values


# Other examples you can try to filter rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

#Select startswith
print(df[df['Courses'].str.startswith("P")])

Conclusion

In this article, I have explained multiple examples of how to filter Pandas DataFrame Rows based on single and multiple conditions, from a list of values e.t.c. Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True. I hope this article helps you learn Pandas.

Happy Learning !!

You May Also Like

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

Pandas Filter Rows using DataFrame.query() Method