• Post author:
  • Post category:Pandas
  • Post last modified:November 4, 2024
  • Reading time:15 mins read
You are currently viewing Pandas Drop Duplicate Rows in DataFrame

To drop duplicate rows in a pandas DataFrame, you can use the drop_duplicates() method. Using this method you can drop duplicate rows on selected multiple columns or all columns.

Advertisements

In this article, we’ll explain several ways of dropping duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates(), DataFrame.apply() and lambda function.

Key Points –

  • The drop_duplicates() method in Pandas is used to remove duplicate rows from a DataFrame.
  • By default, drop_duplicates() keeps the first occurrence of each duplicated row and removes subsequent ones.
  • The subset parameter allows specifying specific columns to check for duplicates instead of the entire DataFrame.
  • The keep parameter controls which duplicates to keep: “first” for the first occurrence, “last” for the last, or False to remove all duplicates.
  • Setting inplace=True modifies the original DataFrame directly instead of creating a new one with duplicates removed.
  • By default, drop_duplicates() returns a new DataFrame with duplicates removed, unless inplace=True is specified.

Quick Examples of Dropping Duplicate Rows

Below are quick examples of dropping duplicate rows in Pandas DataFrame.


# Quick examples of drop duplicate rows

# Example 1: Keep first duplicate row
df2 = df.drop_duplicates()

# Example 2: Using DataFrame.drop_duplicates() 
# To keep first duplicate row
df2 = df.drop_duplicates(keep='first')

# Example 3: Keep last duplicate row
df2 = df.drop_duplicates(keep='last')

# Example 4: Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)

# Example 5: Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)

# Example 6: Drop duplicate rows in place
df.drop_duplicates(inplace=True)

# Example 7: Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')

drop_duplicates() Syntax & Examples

Following is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.


# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
  • subset – Specifies the subset of columns to consider when identifying duplicates. If None (default), all columns are considered.
  • keep – Allowed values are {‘first’, ‘last’, False}, default ‘first’.
    • ‘first’ – Duplicate rows except for the first one are dropped.
    • ‘last' – Duplicate rows except for the last one are dropped.
    • False – All duplicate rows are dropped.
  • inplace – Boolean value. removes rows with duplicates on existing DataFrame when it is True. By default False.
  • ignore_index – Boolean value, by default False.

To run some examples of dropping duplicate rows in Pandas DataFrame, let’s create Pandas DataFrame using data from a dictionary.


# drop_duplicates() Syntax & Examples
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','35days','30days','50days'],
    'Discount':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

pandas drop duplicate rows

Pandas Drop Duplicate Rows

To drop duplicate rows in a Pandas DataFrame, you can use the drop_duplicates() method. By default, this method removes rows with identical values across all columns while keeping the first occurrence.


# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print("After dropping duplicate rows:\n", df2)

In the above example, drop_duplicates() is used without any arguments, which implies subset=None and keep='first'. Therefore, it removes duplicate rows based on all columns while keeping the first occurrence, resulting in a DataFrame with only unique rows. This example yields the below output.

pandas drop duplicate rows

Drop Duplicate Rows and Keep the Last Row

If you want to select all the duplicate rows and their last occurrence, you must pass a keep argument as "last". For instance, df.drop_duplicates(keep='last').


# Keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# 1  PySpark  25000   40days      2300
# 4   Python  22000   35days      1200
# 5    Spark  20000   30days      1000
# 6   pandas  30000   50days      2000

Remove All Duplicate Rows from Pandas DataFrame

You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df.drop_duplicates(keep=False).


# Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# 1  PySpark  25000   40days      2300

Delete Duplicate Rows based on Specific Columns

To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.


# Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# 1  PySpark  25000   40days      2300

Drop Duplicate Rows In Place

To drop duplicate rows in place in a Pandas DataFrame, you can use the drop_duplicates() method with the inplace=True parameter. This will modify the original DataFrame instead of returning a new DataFrame.


# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)

# Output:
#    Courses    Fee Duration  Discount
# 0    Spark  20000   30days      1000
# 1  PySpark  25000   40days      2300
# 2   Python  22000   35days      1200
# 3   pandas  30000   50days      2000

Keep in mind that using inplace=True directly modifies the original DataFrame and does not return a new DataFrame. Therefore, there’s no need to assign the result to a new variable. If you want to keep the original DataFrame unchanged and create a new one with duplicate rows dropped, you can omit inplace=True.

Remove Duplicate Rows Using apply() and Lambda Function

You can remove duplicate rows using DataFrame.apply() and lambda function to convert the DataFrame to lower case and then apply a lower string.


# Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)

# Output:
#    Courses    Fee Duration  Discount
# 0    Spark  20000   30days      1000
# 1  PySpark  25000   40days      2300
# 2   Python  22000   35days      1200
# 3   pandas  30000   50days      2000

Complete Example For Drop Duplicate Rows in DataFrame


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','35days','30days','50days'],
    'Discount':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print(df2)

# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)

# Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)
print(df2)

# Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)

# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)

# Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)

FAQ on Drop Duplicate Rows in DataFrame

How do I drop duplicate rows in a Pandas DataFrame?

You can use the drop_duplicates() method. By default, it removes duplicate rows based on all columns, keeping the first occurrence.

What does dropping duplicate rows mean in Pandas?

Dropping duplicate rows means removing rows from a DataFrame that have the same values across all columns or a specified subset of columns.

Can I drop duplicates based on specific columns in a DataFrame?

You can drop duplicates based on specific columns by specifying those columns in the subset parameter of drop_duplicates().

Can I reset the index after dropping duplicates?

You can reset the index after dropping duplicates by using the ignore_index=True parameter within the drop_duplicates() method. This will reindex the resulting DataFrame, providing a new sequential index starting from 0.

Conclusion

In this article, you learned various methods to drop or remove duplicate rows from a Pandas DataFrame. We explored using pandas.DataFrame.drop_duplicates() with different parameters to customize the behavior, as well as leveraging DataFrame.apply() and lambda functions for more advanced scenarios.

Happy Learning !!

References

Leave a Reply