Pandas Drop Duplicate Rows in DataFrame

  • Post author:
  • Post category:Pandas
  • Post last modified:January 9, 2024
  • Reading time:14 mins read

By using pandas.DataFrame.drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns. In this article, we’ll explain several ways of dropping duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.

Related: Pandas Get List of All Duplicate Rows

1. Quick Examples of Drop Duplicate Rows

If you are in a hurry, below are some quick examples of how to drop duplicate rows in Pandas DataFrame.


# Below are the quick examples

# Example 1: Keep first duplicate row
df2 = df.drop_duplicates()

# Example 2: Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')

# Example 3: Keep last duplicate row
df2 = df.drop_duplicates( keep='last')

# Example 4: Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)

# Example 5: Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)

# Example 6: Drop duplicate rows in place
df.drop_duplicates(inplace=True)

# Example 7: Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')

2. drop_duplicates() Syntax & Examples

Below is the syntax of the DataFrame.drop_duplicates() function that removes duplicate rows from the pandas DataFrame.


# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
  • subset – Column label or sequence of labels. Its default value is none. After passing columns, consider for identifying duplicate rows.
  • keep – Allowed values are {‘first’, ‘last’, False}, default ‘first’.
    • ‘first’ – Duplicate rows except for the first one are dropped.
    • ‘last' – Duplicate rows except for the last one are dropped.
    • False – All duplicate rows are dropped.
  • inplace – Boolean value. removes rows with duplicates on existing DataFrame when it is True. By default False.
  • ignore_index – Boolean value, by default False.

Now, let’s create a DataFrame with a few duplicate rows on columns. Our DataFrame contains column names CoursesFeeDuration, and Discount.


# drop_duplicates() Syntax & Examples
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','35days','30days','50days'],
    'Discount':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

pandas drop duplicate rows

3. Pandas Drop Duplicate Rows

You can use DataFrame.drop_duplicates() without any arguments to drop rows with the same values on all columns. It takes default values subset=None and keep=‘first’. The below example returns four rows after removing duplicate rows in our DataFrame.


# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print("After dropping duplicate rows:\n", df2)

Yields below output.

pandas drop duplicate rows

4. Drop Duplicate Rows and Keep the Last Row

If you want to select all the duplicate rows and their last occurrence, you must pass a keep argument as "last". For instance, df.drop_duplicates(keep='last').

Related : you can drop the last row from DataFrame.


# Keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
1  PySpark  25000   40days      2300
4   Python  22000   35days      1200
5    Spark  20000   30days      1000
6   pandas  30000   50days      2000

5. Remove All Duplicate Rows from Pandas DataFrame

You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df.drop_duplicates(keep=False).


# Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
1  PySpark  25000   40days      2300

6. Delete Duplicate Rows based on Specific Columns

To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.


# Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)

Yields the same output as above.

7. Drop Duplicate Rows In Place


# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  PySpark  25000   40days      2300
2   Python  22000   35days      1200
3   pandas  30000   50days      2000

8. Remove Duplicate Rows Using DataFrame.apply() and Lambda Function

You can remove duplicate rows using DataFrame.apply() and lambda function to convert the DataFrame to lower case and then apply a lower string.


# Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)

Yields the same output as above.

9. Complete Example For Drop Duplicate Rows in DataFrame


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','35days','30days','50days'],
    'Discount':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)

# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print(df2)

# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)

# Remove all duplicate rows 
df2 = df.drop_duplicates(keep=False)
print(df2)

# Delete duplicate rows based on specific columns 
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)

# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)

# Using DataFrame.apply() and lambda function 
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)
1. How do I drop duplicate rows in a Pandas DataFrame?

A. You can drop duplicate rows in a Pandas DataFrame using the drop_duplicates() method.

2. What does the keep parameter in drop_duplicates() do?

A. The keep parameter determines which duplicates to keep. It can take three values:
first (default): Keeps the first occurrence of each duplicated row.
last: Keeps the last occurrence of each duplicated row.
False: Removes all duplicated rows.

3. Can I drop duplicates based on specific columns in a DataFrame?

A. Yes, you can drop duplicates based on specific columns by specifying those columns in the subset parameter of drop_duplicates().

4. How can I drop duplicates in place without creating a new DataFrame?

A. You can use the inplace=True argument to drop duplicates in place.

Conclusion

In this article, you have learned how to drop/remove/delete duplicate rows using pandas.DataFrame.drop_duplicates(), DataFrame.apply() and lambda function with examples.

Happy Learning !!

References

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply