• Post author:
  • Post category:Pandas
  • Post last modified:April 23, 2024
  • Reading time:18 mins read
You are currently viewing Pandas Filter by Column Value

Pandas supports several ways to filter by column value, DataFrame.query() method is the most used to filter the rows based on the expression and returns a new DataFrame after applying the column filter. In case you want to update the existing or referring DataFrame use inplace=True argument. Alternatively, you can also use DataFrame[] with loc[] and DataFrame.apply().

Advertisements

Key Points –

  • Pandas provides efficient methods for filtering DataFrame rows based on column values.
  • The DataFrame.loc[] method is commonly used to filter rows based on column values.
  • Conditions can be specified within the square brackets DataFrame.loc[] to filter rows meeting specific criteria.
  • Boolean indexing using conditions allows for flexible filtering based on column values.
  • Filtering can be performed using comparison operators (e.g., ==, >, <) or logical operators (e.g., &, |) to create complex filtering conditions.

Related:

Quick Examples of Pandas Filter by Column Value

If you are in a hurry, below are quick examples of filtering by column value in Pandas DataFrame.


# Quick examples of pandas filter by column value

# Filter Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Filter Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

To run some examples of pandas filter by column value, let’s see with sample data then run through these examples, and explore the output to understand better. First, let’s create a pandas DataFrame from Dict.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas filter column value

Note that the above Pandas filter by column value also handles None and NaN values in the Duration column. In the examples below, I’ll demonstrate selecting rows with None and NaN values or disregarding these values.

Using query() to Filter by Column Value in pandas

DataFrame.query() function filters rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you want to update the existing DataFrame use inplace=True param.


# Filter all rows with Courses equals 'Spark'
df2=df.query("Courses == 'Spark'")
print("After filtering get a new DataFrame:\n", df2)

Yields below output.

pandas filter column value

In case you want to use a variable in the expression, use @ character.


# Filter Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print("After filtering get a new DataFrame:\n", df2)

# Output:
# After filtering get a new DataFrame:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

If you notice the above examples return a new DataFrame after filtering the rows. if you want to update the existing DataFrame use inplace=True


# Replace current existing DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("Modifying the existing DataFrame:\n", df)

# Output:
# Modifying the existing DataFrame:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

You can also use the df.query() method to filter the DataFrame rows based on the condition, you can specify the condition with not equal(!=) operator. You can use this syntax df.query("Courses != 'Spark'")to select rows where the ‘Courses’ column is not equal to ‘Spark’.


# Filter rows based on condition
df2=df.query("Courses != 'Spark'")
print("After filtering get a new DataFrame:\n", df2)

Yields below output.


# Output:
# After filtering get a new DataFrame:
   Courses    Fee Duration  Discount
1  PySpark  25000   50days      2300
2   Hadoop  23000   30days      1000
3   Python  24000     None      1200
4   Pandas  26000      NaN      2500

Filter Rows Based on List of Column Values

If you have values in a list and want to filter the rows based on the list of values, you can use the in operator with df.query() method. This method filters the rows with the specified list of values.


# Filter Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

The above code you can use in another way, for example, create a list of values and use it as a Python variable.


# Filter Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))

Use not in operator to select rows that are not in a list of column values.


# Filter Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))

# Output:
#   Courses    Fee Duration  Discount
# 2  Hadoop  23000   30days      1000
# 3  Python  24000     None      1200
# 4  Pandas  26000      NaN      2500

If you have column names with special characters use column names surrounded by tick(`) character.


# Create DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))

# Output:
#    Courses  Courses Fee Duration  Discount
# 1  PySpark        25000   50days      2300
# 2   Hadoop        23000   30days      1000
# 3   Python        24000     None      1200
# 4   Pandas        26000      NaN      2500

Pandas Filter by Multiple Columns

In pandas or any table-like structures, most of the time we would need to filter the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.


# Filter by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))

Yields below output. Alternatively, you can also use pandas loc with multiple conditions.


# Output:
  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

Using DataFrame.apply() & Lambda Function

pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that match the values.


# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))

Yields below output. To apply a function for each row, use apply function with a lambda expression.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

Filter Rows with NaN using dropna() Method

In case you want to filter and ignore rows that have None or nan on column values, use DataFrame.dropna() method.


# Filter rows by ignoreing columns that have None & Nan values
print(df.dropna())

Yields below output


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   30days      1000

In case you want to drop columns when column values are None or nan. To delete columns, I have covered some examples of how to drop Pandas DataFrame columns.


# Filter all column that have None or NaN
print(df.dropna(axis='columns'))

Yields below output.


# Output:
   Courses    Fee  Discount
0    Spark  22000      1000
1  PySpark  25000      2300
2   Hadoop  23000      1000
3   Python  24000      1200
4   Pandas  26000      2500

Using DataFrame.loc[] and df[]


# Other examples you can try to filter rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

# Select startswith
print(df[df['Courses'].str.startswith("P")])

Frequently Asked Questions on Pandas Filter by Column Value

What is Pandas filtering by column value?

Pandas filtering by column value involves selecting specific rows from a DataFrame based on the values present in one or more columns.

How can I filter DataFrame rows based on column values in Pandas?

You can filter DataFrame rows in Pandas using methods like boolean indexing, the DataFrame.loc[] accessor, or the DataFrame.query() method.

What is boolean indexing in Pandas?

Boolean indexing in Pandas involves creating boolean masks based on conditions and using these masks to select rows from a DataFrame.

How does the DataFrame.loc[] accessor work for filtering?

The DataFrame.loc[] accessor allows you to specify conditions within square brackets to filter rows based on column values.

What is the purpose of the DataFrame.query() method?

The DataFrame.query() method enables filtering rows based on expressions specified as strings, providing a convenient way to filter DataFrame data.

Conclusion

In this article, I have explained multiple examples of how to filter Pandas DataFrame by column value (Rows and columns). Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True or df[] with df.loc[]. I hope this article helps you learn about Pandas.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium