The pandas.DataFrame.query()
method is used to query rows based on the provided expression (single or multiple column conditions) and returns a new DataFrame. If you want to modify the existing DataFrame in place, you can set the inplace=True
argument. This allows for efficient filtering and manipulation of DataFrame data without creating additional copies.
In this article, I will explain the syntax of the Pandas DataFrame query()
method and several working examples like a query with multiple conditions and a query with a string containing to new few.
Key Points –
- Pandas.DataFrame.query() function filters rows from a DataFrame based on a specified condition.
- Pandas.DataFrame.query() offers a powerful and concise syntax for filtering DataFrame rows, resembling SQL queries, enhancing code readability and maintainability.
- The method supports a wide range of logical and comparison operators, including ==, !=, >, <, >=, <=, and logical operators like and, or, and not.
Quick Examples of Pandas query()
Following are quick examples of the Pandas DataFrame query() method.
# Quick examples of pandas query()
# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")
# Using variable
value='Spark'
df2=df.query("Courses == @value")
# Inpace
df.query("Courses == 'Spark'",inplace=True)
# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
First, let’s create a Pandas DataFrame.
# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
Note that the DataFrame may contain None
and NaN
values in the Duration
column, which will be taken into account in the examples below for selecting rows with None
& NaN
values or selecting while disregarding these values
Related: You can drop the Pandas rows with NaN values.
Using DataFrame.query()
Following is the syntax of the DataFrame.query()
method.
# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
expr
– This parameter specifies the query expression string, which follows Python’s syntax for conditional expressions.inplace
– Defaults toFalse
. When it is set toTrue
, it updates the existing DataFrame, andquery()
method returnsNone
.**kwargs
– This parameter allows passing additional keyword arguments to the query expression. It is optional. Keyword arguments that work with eval()
DataFrame.query() is used to filter rows based on one or multiple conditions in an expression.
# Query all rows with Courses equals 'Spark'
df2 = df.query("Courses == 'Spark'")
print("After filtering the rows based on condition:\n", df2)
Yields below output.
You can use the @
character followed by the variable name. This allows you to reference Python variables directly within the query expression.
# Query Rows by using Python variable
value='Spark'
df2 = df.query("Courses == @value")
print("After filtering the rows based on condition:\n", df2)
# Output:
# After filtering the rows based on condition:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
In the above example, the variable value
is referenced within the query expression "Courses == @value"
, enabling dynamic filtering based on the value stored in the Python variable.
To filter and update the existing DataFrame in place using the query()
method, you can use the inplace=True
parameter. This will modify the original DataFrame directly without needing to reassign it to a new variable.
# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("After filtering the rows based on condition:\n", df)
# Output:
# After filtering the rows based on condition:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
In the above example, the DataFrame df
is modified in place using the query()
method. The expression "Courses=='Spark'"
filters rows where the Courses
column equals Spark
. By setting inplace=True
, the original DataFrame df
is updated with the filtered result.
The !=
operator in a DataFrame query expression allows you to select rows where a specific column’s value does not equal a given value.
# Not equals condition
df2 = df.query("Courses != 'Spark'")
print("After filtering the rows based on condition:\n", df2)
# Output:
# Courses Courses Fee Duration Discount
# 1 PySpark 25000 50days 2300
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
# 4 Pandas 26000 NaN 2500
In the above example, the DataFrame df
is filtered to create a new DataFrame df2
, where the Courses
column does not equal Spark
. This expression ensures that only rows with Courses
values different from Spark
are included in the resulting DataFrame df2
.
Query Rows by the List of Values
Using the in
operator in a DataFrame query expression allows you to filter rows based on whether a specific column’s value is present in a Python list of values.
# Query rows by list of values
df2 = df.query("Courses in ('Spark','PySpark')")
print("After filtering the rows based on condition:\n", df2)
# Output:
# After filtering the rows based on condition:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
# 1 PySpark 25000 50days 2300
Similarly, you can define a Python variable to hold a list of values and then use that variable in your query. This approach allows for more dynamic filtering based on the contents of the list variable.
# Query rows by list of values
values=['Spark','PySpark']
df2 = df.query("Courses in @values")
print("After filtering the rows based on condition:\n", df2)
Using the not-in
operator in a DataFrame query expression allows you to filter rows based on values that are not present in a specified list.
# Query rows not in list of values
values=['Spark','PySpark']
df2 = df.query("Courses not in @values")
print("After filtering the rows based on condition:\n", df)
# Output:
# After filtering the rows based on condition:
# Courses Fee Duration Discount
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
# 4 Pandas 26000 NaN 2500
When dealing with column names containing special characters, such as spaces, you can enclose the column name within backticks (`
) to ensure it is recognized properly in a query expression.
import pandas as pd
import numpy as np
# Create DataFrame
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Courses Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
# Using columns with special characters
df2 = df.query("`Courses Fee` >= 23000")
print("After filtering the rows based on condition:\n", df2)
Yields below output.
# Output:
# Create DataFrame:
Courses Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
# After filtering the rows based on condition:
Courses Courses Fee Duration Discount
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
Query with Multiple Conditions
Querying with multiple conditions involves filtering data in a DataFrame based on more than one criterion simultaneously. Each condition typically involves one or more columns of the DataFrame and specifies a logical relationship that must be satisfied for a row to be included in the filtered result.
# Query by multiple conditions
df2 = df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
print("After filtering the rows based on multiple conditions:\n", df2)
# Output:
# After filtering the rows based on multiple conditions:
# Courses Courses Fee Duration Discount
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
Query Rows using apply()
If you want to filter rows using apply() along with a lambda function, you can do so, but the lambda function needs to return a boolean indicating whether each row should be included or not.
# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print("After filtering the rows based on condition:\n", df2)
# Output:
# After filtering the rows based on condition:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
# 1 PySpark 25000 50days 2300
Other Examples using df[] and loc[]
# Other examples you can try to query rows
df[df["Courses"] == 'Spark']
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])
# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])
# Select startswith
print(df[df['Courses'].str.startswith("P")])
FAQ on Pandas DataFrame Query
To filter rows based on a single condition in a Pandas DataFrame, you can use the DataFrame.query()
method or boolean indexing.
In Pandas, the query()
method is used for DataFrame querying based on a boolean expression. This method allows you to filter rows from a DataFrame that meet a specified condition.
By default, DataFrame.query()
returns a new DataFrame containing the selected rows. To modify the existing DataFrame in place, you can use the inplace=True
parameter.
Common use cases include filtering rows based on single or multiple conditions, selecting rows from a list of values, and handling complex data selection tasks efficiently.
Conclusion
In this article, you have learned to query Pandas DataFrame rows based on single and multiple conditions, from a list of values (checking column value exists in the list of string values), etc.
Happy Learning !!
Related Articles
- Pandas Filter by Column Value
- Pandas Filter Rows by Conditions
- Pandas DataFrame fillna() function
- Pandas apply map (applymap()) Explained
- How to Drop Column From Pandas DataFrame
- Pandas Filter DataFrame by Multiple Conditions
- Pandas Read SQL Query or Table with Examples
- Different Ways to Rename Pandas DataFrame Column
- Apply Multiple Filters to Pandas DataFrame or Series
- Pandas- How to get a Specific Cell Value from DataFrame
Hi there, I liked this article. One point…
Perhaps the implementation has changed, but I don’t think the pandas query likes spaces in the column names. I had a column named “Return Date” and I had to change it to “Return_Date” to avoid getting a key error.