Use drop()
method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value with examples.
Key Points –
- Use boolean indexing to filter rows based on specific conditions in a DataFrame column.
- The condition inside the boolean indexing can involve any comparison or logical operation.
- Apply the mask to the DataFrame using the
.loc[]
indexer orDataFrame.drop()
method. - Use boolean indexing or conditional statements to create a mask identifying rows to be dropped.
- Always ensure to create a new DataFrame or use
inplace=True
parameter to modify the original DataFrame when dropping rows to avoid unintended consequences.
Create DataFrame
Now, let’s create a DataFrame with some rows and columns and then execute the examples provided to validate the results.
# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[22000,25000,np.nan,24000],
'Duration':['30days',None,'55days',np.nan],
'Discount':[1000,2300,1000,np.nan]
}
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)
Yields below output.
Delete Rows Based on Column Values
To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop()
method to remove those rows.
# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)
Yields below output.
In the above example, use the drop() method to remove the rows where the ‘Fee’ column is greater than or equal to 24000. We used inplace=True
to modify the original DataFrame df
.
Using loc[]
Alternatively, another commonly used approach to drop rows based on column values involves using boolean indexing with loc[]
and boolean indexing with df[]
.
# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)
# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)
Yields below output.
# Output:
# Drop rows based on column value:
Courses Fee Duration Discount
1 PySpark 25000.0 None 2300.0
3 Python 24000.0 NaN NaN
Delete Rows Based on Multiple Column Values
Sometimes it may require you to delete the rows based on matching values of multiple columns.
# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)
Yields below output.
# Output:
# Drop rows based on multiple column values:
Courses Fee Duration Discount
1 PySpark 25000.0 None 2300.0
Delete Rows Based on None or NaN Column Values
When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull()
function.
Note: With None or NaN values you cannot use == or != operators.
# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)
Yields below output
# Output:
# Drop rows based on column value:
Courses Fee Duration Discount
0 Spark 22000.0 30days 1000.0
1 PySpark 25000.0 None 2300.0
2 Hadoop NaN 55days 1000.0
Using query()
To use the query() method in Pandas DataFrame to filter rows where the value in the Courses
column is equal to Spark
. However, please note that the query()
method in Pandas DataFrame is used for filtering rows based on a condition, not for deleting rows directly. Instead, you can use the drop()
method to remove rows based on the condition
# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")
# Using variable
value='Spark'
df2=df.query("Courses == @value")
# Inpace
df.query("Courses == 'Spark'",inplace=True)
# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()
Based on the Inverse of Column Values
To delete rows from a DataFrame where the value in the Courses
column is not equal to PySpark
. The tilde ~
operator is used to invert the boolean condition.
# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)
Yields below output.
# Output:
# Drop rows based on column value:
Courses Fee Duration Discount
b PySpark 25000 50days 2300
f PySpark 25000 50days 2000
The above code will drop rows from the DataFrame df
where the value in the Courses
column is not equal to PySpark
. It first finds the index of such rows using the boolean condition and then drops those rows using the drop()
method.
Complete Example
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[22000,25000,np.nan,24000],
'Duration':['30days',None,'55days',np.nan],
'Discount':[1000,2300,1000,np.nan]
}
df = pd.DataFrame(technologies)
print(df)
# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)
# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)
# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)
# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)
# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)
FAQs on Drop Rows Based on Column Value
To drop rows based on a specific column value in a Pandas DataFrame, you can use boolean indexing.
You can drop rows based on multiple conditions using boolean indexing. For example, to drop rows where the value in the “Courses” column is not “PySpark” and the value in the “Score” column is less than 80.
drop()
method? The drop()
method in Pandas DataFrame is used to drop columns or rows by labels. It’s not typically used to drop rows based on column values. Instead, you would use boolean indexing or the query()
method
You can use the query()
method in Pandas DataFrame to drop rows based on column values by specifying the condition you want to filter.
Conclusion
In this article, you have learned how to delete DataFrame rows based on column value using different ways of Pandas.
Happy Learning !!
Related Articles
- How to Drop Rows From Pandas DataFrame Examples
- Pandas Drop Rows by Index
- Delete Last Row From Pandas DataFrame
- Pandas Drop Rows with NaN Values in DataFrame
- Pandas Drop List of Rows From DataFrame
- Pandas Drop Last N Rows From DataFrame
- Pandas Drop First N Rows From DataFrame
- How to drop first row from the Pandas DataFrame
- Pandas Drop First Three Rows From DataFrame
- How to drop duplicate rows from DataFrame?