• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:15 mins read
You are currently viewing Pandas.Index.drop_duplicates() Explained

Pandas.Index.drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.

Advertisements

Index.drop_duplicates() function returns Index object with the duplicate values removed. This function provides the flexibility to choose which duplicate value to be retained. We can drop all duplicate values from the list or leave the first/last occurrence of the duplicated values.

Key Points

  • The primary purpose of drop_duplicates() is to return a new Index object with only the unique values from the original Index. It eliminates any duplicate entries, leaving behind a distinct set of values.
  • The drop_duplicates() operation does not modify the original Index in-place. Instead, it creates a new Index with the unique values. This immutability ensures that the original Index remains unchanged after the operation.
  • The order of appearance in the resulting Index is preserved from the original Index. This means that the first occurrence of a value in the original Index will be retained, and subsequent duplicates will be dropped.
  • The method does not modify the original Index in-place; instead, it returns a new Index with unique values, leaving the original Index unchanged.
  • The drop_duplicates() method supports parameters like keep and inplace. The keep parameter determines which duplicates to retain (‘first’, ‘last’, or False), and inplace modifies the original Index if set to True.
  • If the original Index contains NaN values, they are treated as distinct from each other. The method retains one occurrence of NaN and removes any duplicate NaN entries.

1. Syntax of Index.drop_duplicates()

Following is the syntax of the index.drop_duplicates(). Parameter keep takes one of the following values ‘first’, ‘last’, False, default is ‘first’.


# Syntax of Index.drop_duplicates()
Index.drop_duplicates(keep='first')

Parameters of the Index.drop_duplicates()

Following are the parameters of the Index.drop_duplicates()

  • keep – This parameter determines which duplicates to keep. It can take values like ‘first’ (default), ‘last’, or False. ‘first’ retains the first occurrence of each duplicated value, ‘last’ keeps the last occurrence, and False removes all duplicates.
    • first – Drop duplicates except for the first occurrence.
    • last : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.

Return Value

It returns an Index with duplicate values removed. The parameter ‘keep‘ controls which duplicate values should be removed. The value ‘first’ keeps the first occurrence for each set of duplicated entries.

1. Drop All Duplicates in Pandas Index

Pandas Index is an immutable sequence used for indexing and alignment. This is used to store axis labels for all pandas objects. Sometimes you may have duplicates in pandas index and you can drop these using index.drop_duplicates() (dropduplicates). In order to explain this with an example, first, let’s create an Index which contains duplicate values as shown in below.


importing pandas as pd
import pandas as pd
  
# Creating the Index
idx = pd.Index([15, 21, 4, 4, 22, 4, 3, 21])
  
# Print the Index
print(idx)

Yields below output.


# Output:
Int64Index([15, 21, 4, 4, 22, 4, 3, 21], dtype='int64')

Now, let’s drop all occurrences of duplicates in a Pandas Index and retain only unique values, you can use the drop_duplicates() method.

In the below example, idx2 will contain only the unique values from the original Index, and all duplicate values will be removed. The keep=False parameter ensures that all occurrences of duplicate values are dropped.


# Drop all duplicate occurrences of the index
idx2=idx.drop_duplicates(keep = False)
print(idx2)

Following is the output for the above example, where you see all the duplicates are removed.


# Output:
Int64Index([15, 22, 3], dtype='int64')

2. Drop Duplicates Except the First Occurrence

Now drop all occurrences of duplicates in the Index except the first occurrence. By default ‘first‘ is taken as a value to the keep parameter. Below is the example code.


# Drop duplicates except the first occurrence
idx2 = idx.drop_duplicates(keep ='first')
print(idx2)

So after applying drop_duplicates(keep='first') on Index object idx, all the duplicates in the Index has been dropped by keeping the first occurences. Below is the output for the same.


# Output:
Int64Index([15, 21, 4, 22, 3], dtype='int64')

Related: Pandas Get List of All Duplicate Rows

3. Drop Duplicates Except the Last Occurrence

To drop duplicates in a Pandas Index and retain only the last occurrence of each unique value, you can use the drop_duplicates() method with the keep parameter set to ‘last’. For instance, idx_last will contain only the unique values from the original Index, and the last occurrence of each unique value will be retained.


# Using drop_duplicates() 
# On the Index to keep only the last occurrence
idx_last = idx.drop_duplicates(keep='last')
print(idx_last)

# Output:
# Int64Index([15, 22, 4, 3, 21], dtype='int64')

4. Drop Duplicates Except the False Occurrence

If you want to drop all occurrences of duplicates in a Pandas Index, you can use the drop_duplicates() method with the keep parameter set to False. This will remove all instances of duplicate values, leaving only unique values.

In the below example, idx_false will contain only the unique values from the original Index, and all occurrences of duplicate values will be removed.


# Using drop_duplicates() 
# On the Index to drop all occurrences of duplicates
idx_false = idx.drop_duplicates(keep=False)
print(idx_false)

# Output:
# Int64Index([15, 22, 3], dtype='int64')

Frequently Asked Questions on Pandas.Index.drop_duplicates()

What does drop_duplicates() do in Pandas?

drop_duplicates() is a method in Pandas used to remove duplicate rows from a DataFrame. It considers all columns by default, but you can specify a subset of columns to identify duplicates.

How do I use drop_duplicates()?

The drop_duplicates() method in Pandas is used to remove duplicate rows from a DataFrame. It is a versatile method that allows you to specify various parameters based on your requirements.

Can I specify columns to consider when dropping duplicates?

The drop_duplicates() method in Pandas allows you to specify a subset of columns to consider when identifying and removing duplicates. By default, it considers all columns, but you can focus on specific columns by using the subset parameter.

How does drop_duplicates() decide which rows to keep when there are duplicates?

The drop_duplicates() method in Pandas has a keep parameter that determines which duplicates to keep and which to remove. The keep parameter can take three values: 'first', 'last', and False.

Does drop_duplicates() modify the original DataFrame?

By default, the drop_duplicates() method in Pandas does not modify the original DataFrame. Instead, it returns a new DataFrame with duplicate rows removed based on the specified criteria. The original DataFrame remains unchanged.

Conclusion

In this article, I have explained how to drop duplicates based on Index using Index.drop_duplicates() function. Also explained how to use the keep parameter that takes ‘first/last/false’ values, which controls the deletion of duplicate values.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium