Pandas.DataFrame.query() by Examples

Spread the love

Pandas DataFrame.query() method is used to query the rows based on the expression (single or multiple column conditions) provided and returns a new DataFrame. In case you wanted to update the existing referring DataFrame use inplace=True argument.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like query with multiple conditions and query with string contains to new few.

Related:

1. Quick Examples of pandas query()

If you are in hurry, below are quick examples of how to use pandas.DataFrame.query() method.


# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

#Using variable
value='Spark'
df2=df.query("Courses == @value")

#inpace
df.query("Courses == 'Spark'",inplace=True)

#Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better. First, let’s create a pandas DataFrame from Dict.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

Note that the above DataFrame also contains None and Nan values on Duration column that I would be using in my examples below to select rows that has None & Nan values or select ignoring these values.

3. Using DataFrame.query()

Following is the syntax of DataFrame.query() method.


# query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
  • expr – expression takes conditions to query rows
  • inplace – Defaults to False. When set toTrue, it updates the referring DataFrame and query() method returns None.
  • **kwargs –  Keyword arguments that works with eval()

DataFrame.query() takes condition in expression to select rows from a DataFrame. This expression can have one or multiple conditions.


# Query all rows with Courses equals 'Spark'
df2=df.query("Courses == 'Spark'")
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  22000   30days      1000

In case you wanted to use a variable in the expression, use @ character.


# Query Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print(df2)

If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print(df)

If you wanted to select based on column value not equals then use != operator.


# not equals condition
df2=df.query("Courses != 'Spark'")

Yields below output.


   Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

4. Select Rows Based on List of Column Values

If you have values in a python list and wanted to select the rows based on the list of values, use in operator, it’s like checking a value contains in a list of string values.


# Query Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

You can also write with a list of values in a python variable.


# Query Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))

To select rows that are not in a list of column values can be done using not in operator.


# Query Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))

If you have column names with special characters using column name surrounded by tick ` character .


# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))

5. Query with Multiple Conditions

In Pandas or any table-like structures, most of the time we would need to select the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.


# Query by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))

Yields below output. Alternatively, you can also use pandas loc with multiple conditions.


  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

6. Query Rows using apply()

pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that matched the values. The below example returns every match when Courses contains a list of specified string values.


# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))

Yields below output. A lambda expression is used with pandas to apply the function for each row.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

8. Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

#Select startswith
print(df[df['Courses'].str.startswith("P")])

Conclusion

In this article, I have explained multiple examples of how to query Pandas DataFrame Rows based on single and multiple conditions, from a list of values (checking column value exists in list of string values) e.t.c. Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True. I hope this article helps you learn Pandas.

Happy Learning !!

References

Naveen (NNK)

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has One Comment

  1. Anonymous

    Hi there, I liked this article. One point…

    Perhaps the implementation has changed, but I don’t think the pandas query likes spaces in the column names. I had a column named “Return Date” and I had to change it to “Return_Date” to avoid getting a key error.

You are currently viewing Pandas.DataFrame.query() by Examples