• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing pandas.DataFrame.drop_duplicates() – Examples

Pandas DataFrame.drop_duplicates() function is used to remove duplicates from the DataFrame rows and columns. When data preprocessing and analysis step, data scientists need to check for any duplicate data is present, if so need to figure out a way to remove the duplicates.

Syntax of DataFrame.drop_duplicates()

Following is the syntax of the drop_duplicates() function. It takes subset, keep, inplace and ignore_index as params and returns DataFrame with duplicate rows removed based on the parameters passed. If inplace=True is used, it updates the existing DataFrame object and returns None.


# Syntax of DataFrame.drop_duplicates()
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Following are the parameters of drop_duplicates.

  • subset
  • keep
    • first : Drop duplicates except for the first occurrence.
    • last : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.inplacebool, default False.
  • inplace
  • ignore_index – If True the resulting axis will be labeled 0, 1, …, n – 1.

Considering certain columns is optional. Indexes, including time indexes are ignored. Parameter subset column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep{‘first’, ‘last’, False}, and default ‘first’. keep parameter determines which duplicates (if any) to keep.

Whether to drop duplicates in place or to return a copy.ignore_indexbool, default is False. If True means the resulting axis will be labeled 0, 1, …, n – 1.

1. Drop Duplicates in DataFrame


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","PySpark","Pandas"],
    'Fee' :[20000,22000,22000,30000],
    'Duration':['30days','35days','35days','50days'],
              }
# Create dataframe
df = pd.DataFrame(technologies)
print(df)

Below is the data frame with duplicates.


# Output:
   Courses    Fee Duration
0    Spark  20000   30days
1  PySpark  22000   35days
2  PySpark  22000   35days
3   Pandas  30000   50days

Now applying the drop_duplicates() function on the data frame as shown below, drops the duplicate rows.


# Drop duplicates
df1 = df.drop_duplicates()
print(df1)

Following is the output.


# Output:
   Courses    Fee Duration
0    Spark  20000   30days
1  PySpark  22000   35days
3   Pandas  30000   50days

2. Drop Duplicates on Selected Columns

Use subset param, to drop duplicates on certain selected columns. This is an optional param. By default, it is None, which means using all of the columns for dropping duplicates.


# Using subset option 
df3 = df.drop_duplicates(subset=['Courses'])
print(df3)

# Output:
   Courses    Fee Duration
0    Spark  20000   30days
1  PySpark  22000   35days
3   Pandas  30000   50days

Conclusion

In this article, you have learned how to drop/remove/delete duplicates using pandas.DataFrame.drop_duplicates() . And also learned how to use option subset.

Reference

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium