Pandas.Index.drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.
Index.drop_duplicates()
function returns Index object with the duplicate values removed. This function provides the flexibility to choose which duplicate value to be retained. We can drop all duplicate values from the list or leave the first/last occurrence of the duplicated values.
Key Points
- The primary purpose of
drop_duplicates()
is to return a new Index object with only the unique values from the original Index. It eliminates any duplicate entries, leaving behind a distinct set of values. - The
drop_duplicates()
operation does not modify the original Index in-place. Instead, it creates a new Index with the unique values. This immutability ensures that the original Index remains unchanged after the operation. - The order of appearance in the resulting Index is preserved from the original Index. This means that the first occurrence of a value in the original Index will be retained, and subsequent duplicates will be dropped.
- The method does not modify the original Index in-place; instead, it returns a new Index with unique values, leaving the original Index unchanged.
- The
drop_duplicates()
method supports parameters likekeep
andinplace
. Thekeep
parameter determines which duplicates to retain (‘first’, ‘last’, orFalse
), andinplace
modifies the original Index if set toTrue
. - If the original Index contains NaN values, they are treated as distinct from each other. The method retains one occurrence of NaN and removes any duplicate NaN entries.
1. Syntax of Index.drop_duplicates()
Following is the syntax of the index.drop_duplicates(). Parameter keep
takes one of the following values ‘first
’, ‘last
’, False
, default is ‘first
’.
# Syntax of Index.drop_duplicates()
Index.drop_duplicates(keep='first')
Parameters of the Index.drop_duplicates()
Following are the parameters of the Index.drop_duplicates()
keep
– This parameter determines which duplicates to keep. It can take values like ‘first’ (default), ‘last’, orFalse
. ‘first’ retains the first occurrence of each duplicated value, ‘last’ keeps the last occurrence, andFalse
removes all duplicates.first
– Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.False
: Drop all duplicates.
Return Value
It returns an Index with duplicate values removed. The parameter ‘keep
‘ controls which duplicate values should be removed. The value ‘first
’ keeps the first occurrence for each set of duplicated entries.
1. Drop All Duplicates in Pandas Index
Pandas Index is an immutable sequence used for indexing and alignment. This is used to store axis labels for all pandas objects. Sometimes you may have duplicates in pandas index and you can drop these using index.drop_duplicates()
(dropduplicates). In order to explain this with an example, first, let’s create an Index which contains duplicate values as shown in below.
importing pandas as pd
import pandas as pd
# Creating the Index
idx = pd.Index([15, 21, 4, 4, 22, 4, 3, 21])
# Print the Index
print(idx)
Yields below output.
# Output:
Int64Index([15, 21, 4, 4, 22, 4, 3, 21], dtype='int64')
Now, let’s drop all occurrences of duplicates in a Pandas Index and retain only unique values, you can use the drop_duplicates()
method.
In the below example, idx2
will contain only the unique values from the original Index, and all duplicate values will be removed. The keep=False
parameter ensures that all occurrences of duplicate values are dropped.
# Drop all duplicate occurrences of the index
idx2=idx.drop_duplicates(keep = False)
print(idx2)
Following is the output for the above example, where you see all the duplicates are removed.
# Output:
Int64Index([15, 22, 3], dtype='int64')
2. Drop Duplicates Except the First Occurrence
Now drop all occurrences of duplicates in the Index except the first occurrence. By default ‘first
‘ is taken as a value to the keep parameter. Below is the example code.
# Drop duplicates except the first occurrence
idx2 = idx.drop_duplicates(keep ='first')
print(idx2)
So after applying drop_duplicates(keep='first')
on Index object idx
, all the duplicates in the Index has been dropped by keeping the first occurences. Below is the output for the same.
# Output:
Int64Index([15, 21, 4, 22, 3], dtype='int64')
Related: Pandas Get List of All Duplicate Rows
3. Drop Duplicates Except the Last Occurrence
To drop duplicates in a Pandas Index and retain only the last occurrence of each unique value, you can use the drop_duplicates()
method with the keep
parameter set to ‘last’. For instance, idx_last
will contain only the unique values from the original Index, and the last occurrence of each unique value will be retained.
# Using drop_duplicates()
# On the Index to keep only the last occurrence
idx_last = idx.drop_duplicates(keep='last')
print(idx_last)
# Output:
# Int64Index([15, 22, 4, 3, 21], dtype='int64')
4. Drop Duplicates Except the False Occurrence
If you want to drop all occurrences of duplicates in a Pandas Index, you can use the drop_duplicates()
method with the keep
parameter set to False
. This will remove all instances of duplicate values, leaving only unique values.
In the below example, idx_false
will contain only the unique values from the original Index, and all occurrences of duplicate values will be removed.
# Using drop_duplicates()
# On the Index to drop all occurrences of duplicates
idx_false = idx.drop_duplicates(keep=False)
print(idx_false)
# Output:
# Int64Index([15, 22, 3], dtype='int64')
Frequently Asked Questions on Pandas.Index.drop_duplicates()
drop_duplicates()
is a method in Pandas used to remove duplicate rows from a DataFrame. It considers all columns by default, but you can specify a subset of columns to identify duplicates.
The drop_duplicates()
method in Pandas is used to remove duplicate rows from a DataFrame. It is a versatile method that allows you to specify various parameters based on your requirements.
The drop_duplicates()
method in Pandas allows you to specify a subset of columns to consider when identifying and removing duplicates. By default, it considers all columns, but you can focus on specific columns by using the subset
parameter.
The drop_duplicates()
method in Pandas has a keep
parameter that determines which duplicates to keep and which to remove. The keep
parameter can take three values: 'first'
, 'last'
, and False
.
By default, the drop_duplicates()
method in Pandas does not modify the original DataFrame. Instead, it returns a new DataFrame with duplicate rows removed based on the specified criteria. The original DataFrame remains unchanged.
Conclusion
In this article, I have explained how to drop duplicates based on Index using Index.drop_duplicates() function. Also explained how to use the keep parameter that takes ‘first/last/false’ values, which controls the deletion of duplicate values.
Related Articles
- Pandas Drop Duplicate Rows in DataFrame
- Get the Row Count From Pandas DataFrame
- Pandas Drop Index Column Explained
- Change Column Data Type On Pandas DataFrame
- Pandas apply() Function to Single & Multiple Column(s)
- pandas.DataFrame.drop_duplicates() – Examples
- How to Drop Duplicate Columns in pandas DataFrame
- How to Get Size of Pandas DataFrame?
- Pandas Drop Last Column From DataFrame
- Pandas – Drop Infinite Values From DataFrame
- How to Drop Rows From Pandas DataFrame Examples
- Drop Single & Multiple Columns From Pandas DataFrame