• Post author:
  • Post category:Pandas
  • Post last modified:April 8, 2024
  • Reading time:14 mins read
You are currently viewing Pandas Drop Rows Based on Column Value

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value with examples.

Key Points –

  • Use boolean indexing to filter rows based on specific conditions in a DataFrame column.
  • The condition inside the boolean indexing can involve any comparison or logical operation.
  • Apply the mask to the DataFrame using the .loc[] indexer or DataFrame.drop() method.
  • Use boolean indexing or conditional statements to create a mask identifying rows to be dropped.
  • Always ensure to create a new DataFrame or use inplace=True parameter to modify the original DataFrame when dropping rows to avoid unintended consequences.

Create DataFrame

Now, let’s create a DataFrame with some rows and columns and then execute the examples provided to validate the results.


# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

pandas drop rows value

Delete Rows Based on Column Values

To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop() method to remove those rows.


# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)

Yields below output.

pandas drop rows value

In the above example, use the drop() method to remove the rows where the ‘Fee’ column is greater than or equal to 24000. We used inplace=True to modify the original DataFrame df.

Using loc[]

Alternatively, another commonly used approach to drop rows based on column values involves using boolean indexing with loc[] and boolean indexing with df[].


# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)

# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)

Yields below output.


# Output:
# Drop rows based on column value:
   Courses      Fee Duration  Discount
1  PySpark  25000.0     None    2300.0
3   Python  24000.0      NaN       NaN

Delete Rows Based on Multiple Column Values

Sometimes it may require you to delete the rows based on matching values of multiple columns.


# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)

Yields below output.


# Output:
# Drop rows based on multiple column values:
   Courses      Fee Duration  Discount
1  PySpark  25000.0     None    2300.0

Delete Rows Based on None or NaN Column Values

When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull() function.

Note: With None or NaN values you cannot use == or != operators.


# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)

Yields below output


# Output:
# Drop rows based on column value:
   Courses      Fee Duration  Discount
0    Spark  22000.0   30days    1000.0
1  PySpark  25000.0     None    2300.0
2   Hadoop      NaN   55days    1000.0

Using query()

To use the query() method in Pandas DataFrame to filter rows where the value in the Courses column is equal to Spark. However, please note that the query() method in Pandas DataFrame is used for filtering rows based on a condition, not for deleting rows directly. Instead, you can use the drop() method to remove rows based on the condition


# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

Based on the Inverse of Column Values

To delete rows from a DataFrame where the value in the Courses column is not equal to PySpark. The tilde ~ operator is used to invert the boolean condition.


# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index 
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)

Yields below output.


# Output:
# Drop rows based on column value:
   Courses    Fee Duration  Discount
b  PySpark  25000   50days      2300
f  PySpark  25000   50days      2000

The above code will drop rows from the DataFrame df where the value in the Courses column is not equal to PySpark. It first finds the index of such rows using the boolean condition and then drops those rows using the drop() method.

Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)

# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)

# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)

# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]

# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)

# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)

# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)

FAQs on Drop Rows Based on Column Value

How can I drop rows based on a specific column value?

To drop rows based on a specific column value in a Pandas DataFrame, you can use boolean indexing.

Can I drop rows based on multiple conditions?

You can drop rows based on multiple conditions using boolean indexing. For example, to drop rows where the value in the “Courses” column is not “PySpark” and the value in the “Score” column is less than 80.

Is there a way to drop rows based on column values using the drop() method?

The drop() method in Pandas DataFrame is used to drop columns or rows by labels. It’s not typically used to drop rows based on column values. Instead, you would use boolean indexing or the query() method

Can I use the query() method to drop rows based on column values?

You can use the query() method in Pandas DataFrame to drop rows based on column values by specifying the condition you want to filter.

Conclusion

In this article, you have learned how to delete DataFrame rows based on column value using different ways of Pandas.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium