• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:9 mins read
You are currently viewing Pandas Operator Chaining to Filter DataFrame Rows

pandas support operator chaining (df.query(condition).query(condition)) by calling methods on objects (DataFrame object) sequentially one after another in order to filter rows. It is a programming style programmers prefers to reduce the number of variables and lines. Like any other framework or programming language, pandas supports operator chaining where you can use this to filter rows of DataFrame.

Traditionally operator chaining is used with groupby & aggregate in pandas, In this article, I will explain different ways of using operator chaining in pandas, for example how to filter rows on the output of another filter, using a boolean operator to apply multiple conditions e.t.c.

1. Quick Examples of Operator Chaining to Filter Rows in pandas

If you are in a hurry, below are some quick examples of how to filter rows of pandas DataFrame with operator chaining.


# Below are some quick examples.

# Operator Chaining to Filter Rows in pandas
df2=df[(df.Fee >= 23000) & (df.Fee <= 24000)]

df2=df[(df.Fee == 23000) & (df.Discount == 1000)]

def mask(df, key, value):
   return df[df[key] == value]

pd.DataFrame.mask = mask
df2=df.mask('Fee',23000).mask('Discount',1000)   
print(df2)

# Using DataFrame.query() to chain filter rows
df2 = df.query("Fee < 25000").query("Fee > 22000")

# Using query() method to filter rows operator chaining
df2 = df.query("Fee < 25000 and Fee > 22000")

df2=df.loc[lambda x: x.Fee >= 25000].loc[lambda x: x.Discount >= 2000]

df2 = df.loc[lambda x: x['Fee'] == 25000]

df2 = df.pipe(lambda x: x['Fee'] == 25000)

Now, let’s create a DataFrame with a few rows and columns and execute some examples to learn using operator chaining. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","PySpark","Python","pandas"],
    'Fee' :[22000,25000,23000,24000,26000,25000,20000,24000],
    'Duration':['30day','50days','55days', '40days','60days','50days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,2000,2000,1500]
          }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000    30day      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   55days      1000
3   Python  24000   40days      1200
4   pandas  26000   60days      2500
5  PySpark  25000   50days      2000
6   Python  20000   55days      2000
7   pandas  24000   50days      1500

2. Filter Rows by Specific Values using Operator Chaining


# Filter Rows by Specific Values using Operator Chaining
courses=["Spark","Hadoop","Python","pandas"]
df2=df[df.Courses.isin(courses) & df.Duration.str.contains('5')]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
2  Hadoop  23000   55days      1000
6  Python  20000   55days      2000
7  pandas  24000   50days      1500

Below is another example.


# Another example
df = pd.DataFrame(technologies)
df2=df[(df.Fee >= 23000) & (df.Fee <= 24000)]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
2  Hadoop  23000   55days      1000
3  Python  24000   40days      1200
7  pandas  24000   50days      1500

3. Chaining DataFrame.query() to Filter Rows in pandas

pandas.DataFrame.query() method is recommended way to filter rows and you can chain these operators to apply multiple conditions. For example df.query(“Fee >= 23000”).query(“Fee <= 24000”) , you can also write the same statement as df.query("Fee >= 23000 and Fee <= 24000")


# Using DataFrame.query() to chain filter rows
df = pd.DataFrame(technologies)
df2 = df.query("Fee >= 23000").query("Fee <= 24000")
print(df2)

# Using query() method to filter rows operator chaining
df = pd.DataFrame(technologies)
df2 = df.query("Fee >= 23000 and Fee <= 24000")
print(df2)

Yields same output as above.

4. Using mask() to Filter Rows by Operator Chaining

Chaining mask() methods to filter rows in pandas DataFrame. The mask method works similar to if-then. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.


# Using mask() to Filter Rows by Operator Chaining
def mask(df, key, value):
   return df[df[key] == value]

pd.DataFrame.mask = mask
df = pd.DataFrame(technologies)
df2=df.mask('Fee',23000).mask('Discount',1000)   
print(df2)

Yields below output..


# Output:
  Courses    Fee Duration  Discount
2  Hadoop  23000   55days      1000

5. Using mask() and Lambda Function with Operator Chaining

With mask() method, you can also use lambda function chain operators.


# Using mask and lambda function to filter
df = pd.DataFrame(technologies)
df2 = df.mask(lambda x: x['Courses'] == "pandas")
print(df2)

Yields below output.


# Output:
   Courses      Fee Duration  Discount
a    Spark  22000.0    30day    1000.0
b  PySpark  25000.0   50days    2300.0
c   Hadoop  23000.0   55days    1000.0
d   Python  24000.0   40days    1200.0
e      NaN      NaN      NaN       NaN
f  PySpark  25000.0   50days    2000.0
g   Python  20000.0   55days    2000.0
h      NaN      NaN      NaN       NaN

6. Using Operator Chaining with DataFrame.loc[]

You can also use loc[] to filter rows with operator chaining. Below are some examples.


# Chaining loc[] operator to filter rows
df = pd.DataFrame(technologies)
df2 = df.loc[lambda x: x['Fee'] == 25000]
print(df2)

df = pd.DataFrame(technologies)
df2=df.loc[lambda x: x.Fee >= 23000].loc[lambda x: x.Discount <= 24000]
print(df2)

I will leave this example to you to run and explore the output.

7. Using pipe() Method

Below is another way but this returns ‘TRUE/FALSE’ in the result.


# Using pipe() method
df = pd.DataFrame(technologies)
df2 = df.pipe(lambda x: x['Fee'] == 25000)
print(df2)

Conclusion

In this article, you have learned how to use operator chaining on DataFrame to filter rows by using DataFrame.query(), DataFrame.loc[], and DataFrame.mask() method. Operator chaining is a programmer’s choice to reduce multiple lines and combining multiple statements together.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply