Pandas DataFrame.query()
method is used to query the rows based on the expression (single or multiple column conditions) provided and returns a new DataFrame. In case you wanted to update the existing referring DataFrame use inplace=True
argument.
In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like query with multiple conditions and query with string contains to new few.
Related:
- pandas.DataFrame.filter() – To filter rows by index and columns by name.
- pandas.DataFrame.loc[] – To select rows by indices label and column by name.
- pandas.DataFrame.iloc[] – To select rows by index and column by position.
- pandas.DataFrame.apply() – To custom select using lambda function.
1. Quick Examples of pandas query()
If you are in hurry, below are quick examples of how to use pandas.DataFrame.query() method.
# Below are the quick examples.
# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")
# Using variable
value='Spark'
df2=df.query("Courses == @value")
# Inpace
df.query("Courses == 'Spark'",inplace=True)
# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better. First, let’s create a pandas DataFrame from Dict.
# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
Note that the above DataFrame also contains None
and Nan
values on Duration
column that I would be using in my examples below to select rows that has None
& Nan
values or select ignoring these values.
3. Using DataFrame.query()
Following is the syntax of DataFrame.query()
method.
# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
expr
– expression takes conditions to query rowsinplace
– Defaults toFalse
. When set toTrue
, it updates the referring DataFrame andquery()
method returnsNone
.**kwargs
– Keyword arguments that works with eval()
DataFrame.query() takes condition in expression to select rows from a DataFrame. This expression can have one or multiple conditions.
# Query all rows with Courses equals 'Spark'
df2=df.query("Courses == 'Spark'")
print(df2)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
In case you wanted to use a variable in the expression, use @ character
.
# Query Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print(df2)
If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True
# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print(df)
If you wanted to select based on column value not equals then use != operator
.
# Not equals condition
df2=df.query("Courses != 'Spark'")
Yields below output.
# Output:
Courses Courses Fee Duration Discount
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
4. Select Rows Based on List of Column Values
If you have values in a python list and wanted to select the rows based on the list of values, use in operator
, it’s like checking a value contains in a list of string values.
# Query Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
You can also write with a list of values in a python variable.
# Query Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))
To select rows that are not in a list of column values can be done using not in operator
.
# Query Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))
If you have column names with special characters using column name surrounded by tick ` character
.
# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))
5. Query with Multiple Conditions
In Pandas or any table-like structures, most of the time we would need to select the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.
# Query by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))
Yields below output. Alternatively, you can also use pandas loc with multiple conditions.
# Output:
Courses Courses Fee Duration Discount
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
6. Query Rows using apply()
pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that matched the values. The below example returns every match when Courses contains a list of specified string values.
# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))
Yields below output. A lambda expression is used with pandas to apply the function for each row.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
8. Other Examples using df[] and loc[]
# Other examples you can try to query rows
df[df["Courses"] == 'Spark']
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])
# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])
# Select startswith
print(df[df['Courses'].str.startswith("P")])
Conclusion
In this article, I have explained multiple examples of how to query Pandas DataFrame Rows based on single and multiple conditions, from a list of values (checking column value exists in list of string values) e.t.c. Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True
. I hope this article helps you learn Pandas.
Happy Learning !!
Related Articles
- Different Ways to Rename Pandas DataFrame Column
- How to Drop Column From Pandas DataFrame
- Pandas- How to get a Specific Cell Value from DataFrame
- Pandas Filter DataFrame by Multiple Conditions
- Pandas apply map (applymap()) Explained
- Apply Multiple Filters to Pandas DataFrame or Series
- Pandas Filter Rows by Conditions
- Pandas Filter by Column Value
References
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
- https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval
Hi there, I liked this article. One point…
Perhaps the implementation has changed, but I don’t think the pandas query likes spaces in the column names. I had a column named “Return Date” and I had to change it to “Return_Date” to avoid getting a key error.