Pandas.DataFrame.query() by Examples

The pandas.DataFrame.query() method is used to query rows based on the provided expression (single or multiple column conditions) and returns a new DataFrame. If you want to modify the existing DataFrame in place, you can set the inplace=True argument. This allows for efficient filtering and manipulation of DataFrame data without creating additional copies.

Quick Examples of Pandas query()

Following are quick examples of the Pandas DataFrame query() method.


# Quick examples of pandas query()

# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

First, let’s create a Pandas DataFrame.


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Note that the DataFrame may contain None and NaN values in the Duration column, which will be taken into account in the examples below for selecting rows with None & NaN values or selecting while disregarding these values

Using DataFrame.query()

Following is the syntax of the DataFrame.query() method.


# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)

expr – This parameter specifies the query expression string, which follows Python’s syntax for conditional expressions.
inplace – Defaults to False. When it is set to True, it updates the existing DataFrame, and query() method returns None.
**kwargs – This parameter allows passing additional keyword arguments to the query expression. It is optional. Keyword arguments that work with eval()

DataFrame.query() is used to filter rows based on one or multiple conditions in an expression.


# Query all rows with Courses equals 'Spark'
df2 = df.query("Courses == 'Spark'")
print("After filtering the rows based on condition:\n", df2)

Yields below output.

You can use the @ character followed by the variable name. This allows you to reference Python variables directly within the query expression.


# Query Rows by using Python variable
value='Spark'
df2 = df.query("Courses == @value")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the variable value is referenced within the query expression "Courses == @value", enabling dynamic filtering based on the value stored in the Python variable.

To filter and update the existing DataFrame in place using the query() method, you can use the inplace=True parameter. This will modify the original DataFrame directly without needing to reassign it to a new variable.


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the DataFrame df is modified in place using the query() method. The expression "Courses == 'Spark'" filters rows where the Courses column equals Spark. By setting inplace=True, the original DataFrame df is updated with the filtered result.

The != operator in a DataFrame query expression allows you to select rows where a specific column’s value does not equal a given value.


# Not equals condition
df2 = df.query("Courses != 'Spark'")
print("After filtering the rows based on condition:\n", df2)

# Output:
#    Courses  Courses Fee Duration  Discount
# 1  PySpark        25000   50days      2300
# 2   Hadoop        23000   30days      1000
# 3   Python        24000     None      1200
# 4   Pandas        26000      NaN      2500

In the above example, the DataFrame df is filtered to create a new DataFrame df2, where the Courses column does not equal Spark. This expression ensures that only rows with Courses values different from Spark are included in the resulting DataFrame df2.

Query Rows by the List of Values

Using the in operator in a DataFrame query expression allows you to filter rows based on whether a specific column’s value is present in a Python list of values.


# Query rows by list of values
df2 = df.query("Courses in ('Spark','PySpark')")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Similarly, you can define a Python variable to hold a list of values and then use that variable in your query. This approach allows for more dynamic filtering based on the contents of the list variable.


# Query rows by list of values
values=['Spark','PySpark']
df2 = df.query("Courses in @values")
print("After filtering the rows based on condition:\n", df2)

Using the not-in operator in a DataFrame query expression allows you to filter rows based on values that are not present in a specified list.


# Query rows not in list of values
values=['Spark','PySpark']
df2 = df.query("Courses not in @values")
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 2  Hadoop  23000   30days      1000
# 3  Python  24000     None      1200
# 4  Pandas  26000      NaN      2500

When dealing with column names containing special characters, such as spaces, you can enclose the column name within backticks (`) to ensure it is recognized properly in a query expression.


import pandas as pd
import numpy as np

# Create DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

# Using columns with special characters
df2 = df.query("`Courses Fee` >= 23000")
print("After filtering the rows based on condition:\n", df2)

Yields below output.


# Output:
# Create DataFrame:
    Courses  Courses Fee Duration  Discount
0    Spark        22000   30days      1000
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

# After filtering the rows based on condition:
    Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

Query with Multiple Conditions

Querying with multiple conditions involves filtering data in a DataFrame based on more than one criterion simultaneously. Each condition typically involves one or more columns of the DataFrame and specifies a logical relationship that must be satisfied for a row to be included in the filtered result.


# Query by multiple conditions
df2 = df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
print("After filtering the rows based on multiple conditions:\n", df2)

# Output:
# After filtering the rows based on multiple conditions:
#   Courses  Courses Fee Duration  Discount
# 2  Hadoop        23000   30days      1000
# 3  Python        24000     None      1200

Query Rows using apply()

If you want to filter rows using apply() along with a lambda function, you can do so, but the lambda function needs to return a boolean indicating whether each row should be included or not.


# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

# Select startswith
print(df[df['Courses'].str.startswith("P")])

FAQ on Pandas DataFrame Query

How do I filter rows based on a single condition in a Pandas DataFrame?

To filter rows based on a single condition in a Pandas DataFrame, you can use the DataFrame.query() method or boolean indexing.

How do I use the query() method for DataFrame querying?

In Pandas, the query() method is used for DataFrame querying based on a boolean expression. This method allows you to filter rows from a DataFrame that meet a specified condition.

Does DataFrame.query() return a new DataFrame or modify the existing one?

By default, DataFrame.query() returns a new DataFrame containing the selected rows. To modify the existing DataFrame in place, you can use the inplace=True parameter.

What are some common use cases for DataFrame.query()?

Common use cases include filtering rows based on single or multiple conditions, selecting rows from a list of values, and handling complex data selection tasks efficiently.

Conclusion

In this article, you have learned to query Pandas DataFrame rows based on single and multiple conditions, from a list of values (checking column value exists in the list of string values), etc.

Happy Learning !!

This Post Has One Comment

Anonymous October 23, 2022

Hi there, I liked this article. One point…

Perhaps the implementation has changed, but I don’t think the pandas query likes spaces in the column names. I had a column named “Return Date” and I had to change it to “Return_Date” to avoid getting a key error.

Quick Examples of Pandas query()

Using DataFrame.query()

Query Rows by the List of Values

Query with Multiple Conditions

Query Rows using apply()

Other Examples using df[] and loc[]

FAQ on Pandas DataFrame Query

Conclusion

Related Articles

Leave a Reply

This Post Has One Comment