Pandas Drop Rows Based on Column Value

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value.

Create DataFrame

To run some examples of drop rows based on column value, let’s create Pandas DataFrame.


# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

Delete Rows Using drop()

To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop() method to remove those rows.


# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)

Yields below output.

In the above example, use the drop() method to remove the rows where the Fee column is greater than or equal to 24000. We used inplace=True to modify the original DataFrame df.

Using loc[]

Using loc[] to drop rows based on a column value involves leveraging the loc[] accessor in pandas to filter rows from a DataFrame according to a condition applied to a specific column, effectively filtering out rows that do not meet the condition.


# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)

# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0
# 3   Python  24000.0      NaN       NaN

Delete Rows Based on Multiple Column Values

To delete rows from a DataFrame based on multiple column values in pandas, you can use the drop() function along with boolean indexing.


# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)

# Output:
# Drop rows based on multiple column values:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0

Delete Rows Based on None or NaN Column Values

When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull() function.

Note: With None or NaN values you cannot use == or != operators.


# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 0    Spark  22000.0   30days    1000.0
# 1  PySpark  25000.0     None    2300.0
# 2   Hadoop      NaN   55days    1000.0

Using query()

To use the DataFrame.query() function is primarily used for filtering rows based on a condition, rather than directly deleting rows. However, you can filter rows using query() and then assign the filtered DataFrame back to the original DataFrame, effectively removing the rows that do not meet the specified condition.


# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

Based on the Inverse of Column Values

To delete rows from a DataFrame where the value in the Courses column is not equal to PySpark. The tilde ~ operator is used to invert the boolean condition.


# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index 
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)

# Output:
# Drop rows based on column value:
#    Courses    Fee Duration  Discount
# b  PySpark  25000   50days      2300
# f  PySpark  25000   50days      2000

The above code will drop rows from the DataFrame df where the value in the Courses column is not equal to PySpark. It first finds the index of such rows using the boolean condition and then drops those rows using the drop() method.

Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)

# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)

# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)

# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]

# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)

# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)

# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)

FAQs on Dropping Rows Based on Column Value

What does it mean to drop rows based on column values?

Dropping rows based on column values refers to removing rows from a dataset where certain conditions are met within specific columns. This is a common operation in data cleaning and filtering tasks.

Can I drop rows based on the inverse of a condition?

You can use the ~ (tilde) operator to negate a condition. For example, df.drop(df[~(df['Column'] == 'Value')].index) will drop rows where ‘Column’ is not equal to ‘Value’.

Can I drop rows based on multiple conditions?

You can drop rows based on multiple conditions by combining them using logical operators like & (AND) or | (OR).

Is it possible to drop rows based on NULL or missing values?

It’s entirely possible to drop rows based on NULL or missing values in a DataFrame. In pandas, you can use the dropna() method

Conclusion

In this article, I have explained various methods in Pandas for deleting DataFrame rows based on column values. These techniques, including boolean indexing, the drop() method, and query(), offer flexibility in managing and filtering DataFrame data according to specific conditions.

Happy Learning !!

References

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html