Use drop()
method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value.
Key Points –
- Use boolean indexing to filter rows based on specific conditions in a DataFrame column.
- The condition inside the boolean indexing can involve any comparison or logical operation.
- Apply the mask to the DataFrame using the
.loc[]
indexer orDataFrame.drop()
method. - Use boolean indexing or conditional statements to create a mask identifying rows to be dropped.
- Always ensure to create a new DataFrame or use
inplace=True
parameter to modify the original DataFrame when dropping rows to avoid unintended consequences.
Create DataFrame
To run some examples of drop rows based on column value, let’s create Pandas DataFrame.
# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[22000,25000,np.nan,24000],
'Duration':['30days',None,'55days',np.nan],
'Discount':[1000,2300,1000,np.nan]
}
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)
Yields below output.
Delete Rows Using drop()
To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop()
method to remove those rows.
# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)
Yields below output.
In the above example, use the drop()
method to remove the rows where the Fee
column is greater than or equal to 24000. We used inplace=True
to modify the original DataFrame df
.
Using loc[]
Using loc[]
to drop rows based on a column value involves leveraging the loc[]
accessor in pandas to filter rows from a DataFrame according to a condition applied to a specific column, effectively filtering out rows that do not meet the condition.
# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)
# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)
# Output:
# Drop rows based on column value:
# Courses Fee Duration Discount
# 1 PySpark 25000.0 None 2300.0
# 3 Python 24000.0 NaN NaN
Delete Rows Based on Multiple Column Values
To delete rows from a DataFrame based on multiple column values in pandas, you can use the drop()
function along with boolean indexing.
# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)
# Output:
# Drop rows based on multiple column values:
# Courses Fee Duration Discount
# 1 PySpark 25000.0 None 2300.0
Delete Rows Based on None or NaN Column Values
When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull()
function.
Note: With None or NaN values you cannot use == or != operators.
# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)
# Output:
# Drop rows based on column value:
# Courses Fee Duration Discount
# 0 Spark 22000.0 30days 1000.0
# 1 PySpark 25000.0 None 2300.0
# 2 Hadoop NaN 55days 1000.0
Using query()
To use the DataFrame.query() function is primarily used for filtering rows based on a condition, rather than directly deleting rows. However, you can filter rows using query()
and then assign the filtered DataFrame back to the original DataFrame, effectively removing the rows that do not meet the specified condition.
# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")
# Using variable
value='Spark'
df2=df.query("Courses == @value")
# Inpace
df.query("Courses == 'Spark'",inplace=True)
# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()
Based on the Inverse of Column Values
To delete rows from a DataFrame where the value in the Courses
column is not equal to PySpark
. The tilde ~
operator is used to invert the boolean condition.
# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)
# Output:
# Drop rows based on column value:
# Courses Fee Duration Discount
# b PySpark 25000 50days 2300
# f PySpark 25000 50days 2000
The above code will drop rows from the DataFrame df
where the value in the Courses
column is not equal to PySpark
. It first finds the index of such rows using the boolean condition and then drops those rows using the drop()
method.
Complete Example
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[22000,25000,np.nan,24000],
'Duration':['30days',None,'55days',np.nan],
'Discount':[1000,2300,1000,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)
# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)
# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)
# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)
# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)
FAQs on Dropping Rows Based on Column Value
Dropping rows based on column values refers to removing rows from a dataset where certain conditions are met within specific columns. This is a common operation in data cleaning and filtering tasks.
You can use the ~
(tilde) operator to negate a condition. For example, df.drop(df[~(df['Column'] == 'Value')].index)
will drop rows where ‘Column’ is not equal to ‘Value’.
You can drop rows based on multiple conditions by combining them using logical operators like &
(AND) or |
(OR).
It’s entirely possible to drop rows based on NULL or missing values in a DataFrame. In pandas, you can use the dropna()
method
Conclusion
In this article, I have explained various methods in Pandas for deleting DataFrame rows based on column values. These techniques, including boolean indexing, the drop()
method, and query()
, offer flexibility in managing and filtering DataFrame data according to specific conditions.
Happy Learning !!
Related Articles
- How to Drop Rows From Pandas DataFrame
- Pandas Drop Rows by Index
- How to drop duplicate rows from DataFrame?
- Pandas Drop Rows with NaN Values in DataFrame
- Pandas Drop List of Rows From DataFrame
- Pandas Drop Last N Rows From DataFrame
- Pandas Drop First N Rows From DataFrame
- How to drop first row from the Pandas DataFrame