• Post author:
  • Post category:Pandas
  • Post last modified:April 11, 2024
  • Reading time:19 mins read
You are currently viewing Pandas.DataFrame.query() by Examples

The pandas.DataFrame.query() method is used to query rows based on the provided expression (single or multiple column conditions) and returns a new DataFrame. If you want to update the existing DataFrame in place, you can use the inplace=True argument.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like a query with multiple conditions and a query with a string containing to new few.

Key Points –

  • Pandas.DataFrame.query() function filters rows from a DataFrame based on a specified condition.
  • Pandas.DataFrame.query() offers a powerful and concise syntax for filtering DataFrame rows, resembling SQL queries, enhancing code readability and maintainability.
  • The method supports a wide range of logical and comparison operators, including ==, !=, >, <, >=, <=, and logical operators like and, or, and not.
  • This function provides a convenient way to perform complex data selections and is particularly useful for large datasets where performance is crucial.

Quick Examples of Pandas query()

If you are in a hurry, below are quick examples of how to use Pandas DataFrame query() method.


# Quick examples of pandas query()

# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

To run some examples of the pandas DataFrame query() method, let’s create a DataFrame with a few rows and columns, execute these examples, and validate results. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas dataframe query

Note that the above DataFrame may contain None and NaN values in the Duration column, which will be considered in the examples below for selecting rows with None & NaN values or selecting while ignoring these values.

Related: You can drop the Pandas rows with NaN values.

Using DataFrame.query()

Following is the syntax of the DataFrame.query() method.


# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
  • expr – This parameter specifies the query expression string, which follows Python’s syntax for conditional expressions.
  • inplace – Defaults to False. When it is set to True, it updates the existing DataFrame, and query() method returns None.
  • **kwargs –  This parameter allows passing additional keyword arguments to the query expression. It is optional. Keyword arguments that work with eval()

DataFrame.query() is used to filter rows based on one or multiple conditions in an expression.


# Query all rows with Courses equals 'Spark'
df2 = df.query("Courses == 'Spark'")
print("After filtering the rows based on condition:\n", df2)

Yields below output.

pandas dataframe query

In case you want to use a variable in the expression, you can use the @ character followed by the variable name. This allows you to reference Python variables directly within the query expression.


# Query Rows by using Python variable
value='Spark'
df2 = df.query("Courses == @value")
print("After filtering the rows based on condition:\n", df2)

In the above example, the variable value is referenced within the query expression "Courses == @value", enabling dynamic filtering based on the value stored in the Python variable. Yields the same output as above.

If you want to update the existing DataFrame directly without creating a new DataFrame, you can use the inplace=True parameter in the DataFrame.query() method.


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("After filtering the rows based on condition:\n", df)

In the above example, the DataFrame df is modified in place using the query() method. The expression "Courses == 'Spark'" filters rows where the Courses column equals Spark. By setting inplace=True, the original DataFrame df is updated with the filtered result.

The != operator in a DataFrame query expression allows you to select rows where a specific column’s value does not equal a given value.


# Not equals condition
df2 = df.query("Courses != 'Spark'")
print("After filtering the rows based on condition:\n", df2)

In the above example, the DataFrame df is filtered to create a new DataFrame df2, where the Courses column does not equal Spark. This expression ensures that only rows with Courses values different from Spark are included in the resulting DataFrame df2. This example yields the below output.


# Output:
   Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

Select Rows Based on List of Column Values

Using the in operator in a DataFrame query expression allows you to filter rows based on whether a specific column’s value is present in a Python list of values. It’s akin to checking if a value is contained within a list of string values.


# Query Rows by list of values
df2 = df.query("Courses in ('Spark','PySpark')")
print("After filtering the rows based on condition:\n", df2)

Yields below output.


# Output:
# After filtering the rows based on condition:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

You can also write with a list of values in a Python variable.


# Query Rows by list of values
values=['Spark','PySpark']
df2 = df.query("Courses in @values")
print("After filtering the rows based on condition:\n", df2)

Yields the same output as above.

Using the not-in operator in a DataFrame query expression allows you to filter rows based on values that are not present in a specified list.


# Query Rows not in list of values
values=['Spark','PySpark']
df2 = df.query("Courses not in @values")
print("After filtering the rows based on condition:\n", df)

Yields below output.


# Output:
# After filtering the rows based on condition:
   Courses    Fee Duration  Discount
2  Hadoop  23000   30days      1000
3  Python  24000     None      1200
4  Pandas  26000      NaN      2500

When dealing with column names containing special characters, such as spaces, you can enclose the column name within backticks (`) to ensure it is recognized properly in a query expression.


import pandas as pd
import numpy as np

# Create DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

# Using columns with special characters
df2 = df.query("`Courses Fee` >= 23000")
print("After filtering the rows based on condition:\n", df2)

Yields below output.


# Output:
# Create DataFrame:
    Courses  Courses Fee Duration  Discount
0    Spark        22000   30days      1000
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

# After filtering the rows based on condition:
    Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

Query with Multiple Conditions

In Pandas or any table-like structures, most of the time we would need to select the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.


# Query by multiple conditions
df2 = df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
print("After filtering the rows based on multiple conditions:\n", df2)

Yields below output. Alternatively, you can also use pandas loc with multiple conditions.


# Output:
# After filtering the rows based on multiple conditions:
  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

Query Rows using apply()

pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that match the values. The below example returns every match when Courses contains a list of specified string values.


# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print("After filtering the rows based on condition:\n", df2)

Yields below output. A lambda expression is used with pandas to apply the function for each row.


# Output:
# After filtering the rows based on condition:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

# Select startswith
print(df[df['Courses'].str.startswith("P")])

FAQ on Pandas DataFrame Query

How do I filter rows based on a single condition in a Pandas DataFrame?

To filter rows based on a single condition in a Pandas DataFrame, you can use the DataFrame.query() method or boolean indexing.

How can I filter rows based on multiple conditions?

To filter rows based on multiple conditions in a Pandas DataFrame, you can use the DataFrame.query() method or boolean indexing.

How do I use the query() method for DataFrame querying?

In Pandas, the query() method is used for DataFrame querying based on a boolean expression. This method allows you to filter rows from a DataFrame that meet a specified condition.

Does DataFrame.query() return a new DataFrame or modify the existing one?

By default, DataFrame.query() returns a new DataFrame containing the selected rows. To modify the existing DataFrame in place, you can use the inplace=True parameter.

What are some common use cases for DataFrame.query()?

Common use cases include filtering rows based on single or multiple conditions, selecting rows from a list of values, and handling complex data selection tasks efficiently.

Conclusion

In this article, I have explained multiple examples of how to query Pandas DataFrame Rows based on single and multiple conditions, from a list of values (checking column value exists in the list of string values), etc.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has One Comment

  1. Anonymous

    Hi there, I liked this article. One point…

    Perhaps the implementation has changed, but I don’t think the pandas query likes spaces in the column names. I had a column named “Return Date” and I had to change it to “Return_Date” to avoid getting a key error.