Site icon Spark By {Examples}

Pandas Drop Rows Based on Column Value

pandas delete rows column

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value with examples.

Key Points –

Create DataFrame

To run some examples of drop rows based on column value, let’s create Pandas DataFrame using data from a dictionary.


# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

pandas drop rows value

Delete Rows Using drop()

To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop() method to remove those rows.


# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)

Yields below output.

pandas drop rows value

In the above example, use the drop() method to remove the rows where the Fee column is greater than or equal to 24000. We used inplace=True to modify the original DataFrame df.

Using loc[]

Using loc[] to drop rows based on a column value involves leveraging the loc[] accessor in pandas to select rows from a DataFrame based on a condition applied to a specific column, effectively filtering out rows that do not meet the condition.


# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)

# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0
# 3   Python  24000.0      NaN       NaN

Delete Rows Based on Multiple Column Values

To delete rows from a DataFrame based on multiple column values in pandas, you can use the drop() method along with boolean indexing.


# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)

# Output:
# Drop rows based on multiple column values:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0

Delete Rows Based on None or NaN Column Values

When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull() function.

Note: With None or NaN values you cannot use == or != operators.


# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 0    Spark  22000.0   30days    1000.0
# 1  PySpark  25000.0     None    2300.0
# 2   Hadoop      NaN   55days    1000.0

Using query()

To use the DataFrame.query() method in Pandas is primarily used for filtering rows based on a condition, rather than directly deleting rows. However, you can filter rows using query() and then assign the filtered DataFrame back to the original DataFrame, effectively removing the rows that do not meet the specified condition.


# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

Based on the Inverse of Column Values

To delete rows from a DataFrame where the value in the Courses column is not equal to PySpark. The tilde ~ operator is used to invert the boolean condition.


# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index 
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)

# Output:
# Drop rows based on column value:
#    Courses    Fee Duration  Discount
# b  PySpark  25000   50days      2300
# f  PySpark  25000   50days      2000

The above code will drop rows from the DataFrame df where the value in the Courses column is not equal to PySpark. It first finds the index of such rows using the boolean condition and then drops those rows using the drop() method.

Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)

# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)

# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)

# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]

# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)

# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)

# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)

FAQ on Drop Rows Based on Column Value

How can I drop rows from a DataFrame based on a condition in a specific column?

You can use boolean indexing to create a mask based on the condition you want to apply to the column. Then, you can pass this mask to the DataFrame’s drop() method to remove the corresponding rows.

Can I use the query() method to drop rows based on column values?

You can use the query() method in Pandas DataFrame to drop rows based on column values by specifying the condition you want to filter.

Can I use the query() method to drop rows based on column values?

The query() method is used to filter rows based on conditions, but it doesn’t directly delete rows. You can use it to create a filtered view of the DataFrame, but to actually remove rows, you would need to use other methods like drop()

Can I drop rows based on the inverse of a condition?

You can use the ~ (tilde) operator to negate a condition. For example, df.drop(df[~(df['Column'] == 'Value')].index) will drop rows where ‘Column’ is not equal to ‘Value’.

Conclusion

In this article, I have explained various methods in Pandas for deleting DataFrame rows based on column values. These techniques, including boolean indexing, the drop() method, and query(), offer flexibility in managing and filtering DataFrame data according to specific conditions.

Happy Learning !!

References

Exit mobile version