Pandas.Index.drop_duplicates() Explained

Spread the love

Pandas.Index.drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.

<strong>Index.drop_duplicates()</strong> function returns Index object with the duplicate values removed. This function provides the flexibility to choose which duplicate value to be retained. We can drop all duplicate values from the list or leave the first/last occurrence of the duplicated values.

1. Syntax of Index.drop_duplicates()

Following is the syntax of the index.drop_duplicates(). Parameter keep takes one of the following values ‘first’, ‘last’, False, default is ‘first’.


# Syntax of Index.drop_duplicates()
Index.drop_duplicates(keep='first')
  • first’ : Drop duplicates except for the first occurrence.
  • last’ : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

This return Index with duplicate values removed. The parameter ‘keep‘ controls which duplicate values should be removed. The value ‘first’ keeps the first occurrence for each set of duplicated entries.

1. Drop All Duplicates in pandas Index.

Pandas Index is a immutable sequence used for indexing and alignment. This is used to store axis labels for all pandas objects. Sometimes you may have duplicates in pandas index and you can drop these using index.drop_duplicates() (dropduplicates). In order to explain this with example, first, lets create an Index which contains duplicates values as show in below.


importing pandas as pd
import pandas as pd
  
# Creating the Index
idx = pd.Index([15, 21, 4, 4, 22, 4, 3, 21])
  
# Print the Index
print(idx)

Below is the output .


# Output:
Int64Index([15, 21, 4, 4, 22, 4, 3, 21], dtype='int64')

Now, let’s drop all occurrences of duplicate values in the Index by using drop_duplicates() as shown below, I am using keep=False as I wanted to remove all occurance of duplicates.


# Drop all duplicate occurrences of the index
idx2=idx.drop_duplicates(keep = False)
print(idx2)

Following is the output for the above example, where you see all the duplicates are removed.


# Output:
Int64Index([15, 22, 3], dtype='int64')

2. Drop Duplicates Except the First Occurrence

Now drop all occurrences of duplicates in the Index except the first occurrence. By default ‘first‘ is taken as a value to the keep parameter. Below is the example code.


# Drop Duplicates Except the First Occurrence
idx2 = idx.drop_duplicates(keep ='first')
print(idx2)

So after applying drop_duplicates(keep=’first’) on Index object idx , all the duplicates in the Index has been dropped by keeping the first occurences . Below is the output for the same.


# Output:
Int64Index([15, 21, 4, 22, 3], dtype='int64')

Related: Pandas Get List of All Duplicate Rows

Conclusion

In this article I have explained how to drop duplicates based on Index using Index.drop_duplicates() function. Also explained how to use the keep parameter that takes ‘first/last/false’ values, which controls the deletion of duplicate values.

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Pandas.Index.drop_duplicates() Explained