Pandas support several ways to filter by column value, DataFrame.query() function is the most used to filter rows based on a specified expression, returning a new DataFrame with the applied column filter. To update the existing or referring DataFrame use inplace=True
argument. Alternatively, you can also use DataFrame[] with loc[] and DataFrame.apply().
Key Points –
- Pandas provides efficient methods for filtering DataFrame rows based on column values.
- The
DataFrame.loc[]
method is commonly used to filter rows based on column values. - Conditions can be specified within the square brackets
DataFrame.loc[]
to filter rows meeting specific criteria. - Boolean indexing using conditions allows for flexible filtering based on column values.
- Filtering can be performed using comparison operators (e.g.,
==
,>
,<
) or logical operators (e.g.,&
,|
) to create complex filtering conditions.
Quick Examples of Pandas Filter by Column Value
If you are in a hurry, below are quick examples of filtering by column value in Pandas DataFrame.
# Quick examples of pandas filter by column value
# Filter Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")
# Using variable
value='Spark'
df2=df.query("Courses == @value")
# Inpace
df.query("Courses == 'Spark'",inplace=True)
# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
# Other ways to Filter Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()
To run some examples of filtering by column value in Pandas DataFrame, let’s create DataFrame.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
Note: The above method for filtering a Pandas DataFrame by column value also manages None
and NaN
values in the Duration
column. In the examples below, I’ll show how to select rows containing None
and NaN
values, as well as how to exclude these values.
Using query() to Filter by Column Value in pandas
DataFrame.query()
function filters rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you want to update the existing DataFrame use inplace=True
param.
# Filter all rows with Courses equals 'Spark'
df2=df.query("Courses == 'Spark'")
print("After filtering get a new DataFrame:\n", df2)
Yields below output.
In case you want to use a variable in the expression, use @ character
.
# Filter Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print("After filtering get a new DataFrame:\n", df2)
# Output:
# After filtering get a new DataFrame:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
If you notice the above examples return a new DataFrame after filtering the rows. if you want to update the existing DataFrame use inplace=True
# Replace current existing DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("Modifying the existing DataFrame:\n", df)
# Output:
# Modifying the existing DataFrame:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
You can also use the df.query()
method to filter the DataFrame rows based on the condition, you can specify the condition with not equal(!=
) operator. You can use this syntax df.query("Courses != 'Spark'")
to select rows where the ‘Courses’ column is not equal to ‘Spark’.
# Filter rows based on condition
df2=df.query("Courses != 'Spark'")
print("After filtering get a new DataFrame:\n", df2)
# # Output:
# After filtering get a new DataFrame:
# Courses Fee Duration Discount
# 1 PySpark 25000 50days 2300
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
# 4 Pandas 26000 NaN 2500
Filter Rows Based on List of Column Values
If you have values in a list and want to filter the rows based on the list of values, you can use the in
operator with df.query()
method. This method filters the rows with the specified list of values.
# Filter Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))
# Output:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
# 1 PySpark 25000 50days 2300
The above code you can use in another way, for example, create a list of values and use it as a Python variable.
# Filter Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))
Use not in
operator to select rows that are not in a list of column values.
# Filter Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))
# Output:
# Courses Fee Duration Discount
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
# 4 Pandas 26000 NaN 2500
If you have column names with special characters use column names surrounded by tick(`
) character.
# Create DataFrame
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Courses Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))
# Output:
# Courses Courses Fee Duration Discount
# 1 PySpark 25000 50days 2300
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
# 4 Pandas 26000 NaN 2500
Pandas Filter by Multiple Columns
In pandas or any table-like structures, most of the time we would need to filter the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.
# Filter by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))
# Output:
# Courses Courses Fee Duration Discount
# 2 Hadoop 23000 30days 1000
# 3 Python 24000 None 1200
Alternatively, you can also use pandas loc with multiple conditions.
Using Lambda Function
To filter rows using a lambda function within the apply()
method, you can apply the lambda function to each row and use it to filter the DataFrame based on the condition provided.
# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))
# Output:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
# 1 PySpark 25000 50days 2300
To apply a function for each row, use apply function with a lambda expression.
Filter Rows with NaN using dropna() Method
To filter rows containing NaN values in any column, you can use the dropna() method without specifying a subset. This will drop all rows where any column contains NaN values and return a DataFrame without those rows.
# Filter rows by ignoreing columns that have None & Nan values
print(df.dropna())
# Output:
# Courses Fee Duration Discount
# 0 Spark 22000 30days 1000
# 1 PySpark 25000 50days 2300
# 2 Hadoop 23000 30days 1000
To drop columns that contain NaN
or None
values, use the dropna()
function along with the axis=columns
parameter. This will drop all columns that contain NaN
or None
values and return a DataFrame without those columns.
# Filter all column that have None or NaN
print(df.dropna(axis='columns'))
# Output:
# Courses Fee Discount
# 0 Spark 22000 1000
# 1 PySpark 25000 2300
# 2 Hadoop 23000 1000
# 3 Python 24000 1200
# 4 Pandas 26000 2500
Using DataFrame.loc[] and df[]
# Other examples you can try to filter rows
df[df["Courses"] == 'Spark']
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])
# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])
# Select startswith
print(df[df['Courses'].str.startswith("P")])
Frequently Asked Questions
Pandas filtering by column value involves selecting specific rows from a DataFrame based on the values present in one or more columns.
Boolean indexing in Pandas involves creating boolean masks based on conditions and using these masks to select rows from a DataFrame.
The DataFrame.loc[]
accessor allows you to specify conditions within square brackets to filter rows based on column values.
The DataFrame.query()
method enables filtering rows based on expressions specified as strings, providing a convenient way to filter DataFrame data.
Conclusion
In this article, I’ve provided various examples demonstrating how to filter a Pandas DataFrame based on column values, both for rows and columns. It’s important to note that when querying DataFrame rows, a new DataFrame containing the selected rows is returned. To modify the existing DataFrame, use the inplace=True
parameter. or directly assign the filtered DataFrame using df[]
or df.loc[]
.
Happy Learning !!
Related Articles
- Pandas filter by index
- How to Use NOT IN Filter in Pandas
- Pandas Filter DataFrame Rows on Dates
- Pandas Filter Rows Using IN Like SQL
- DataFrame filter() – Usage & Examples
- Pandas Filter DataFrame by Substring criteria
- Pandas Operator Chaining to Filter DataFrame Rows
- Pandas Filter Rows with NAN Value from DataFrame Column