In pandas, drop_duplicates() is used to remove duplicates from the Series (get rid of repeated values from the Series). In this article, I’ll explain how to use the Series.drop_duplicates() function and show you the steps. By following these steps, you can make a new list that’s updated and doesn’t have any repeated values. And if your original Series doesn’t have any repeated values, it will return the original list without any changes.
Key Points –
drop_duplicates()
is a method available for pandas Series objects that allow for the removal of duplicate values.- It operates on a Series and returns a new Series with unique values after removing duplicate values based on specified criteria.
- The method supports parameters such as
keep
to determine which duplicates to retain (‘first’, ‘last’, or False for removing all duplicates) andinplace
to operate on the original Series if True. - It is commonly used in data preprocessing tasks to ensure data quality by eliminating redundant or duplicated information.
- The method’s flexibility and efficiency make it a valuable tool in data cleaning, analysis, and preparation workflows, contributing to more accurate and reliable results.
Syntax of Pandas Series.drop_duplicates() Function
Following is the syntax of the pandas Series.drop_duplicates() function.
# Syntax of Series.drop_duplicates() function
Series.drop_duplicates(keep='first', inplace=False)
Parameters of the Series.drop_duplicates()
Following are the parameters of the Series.drop_duplicates() function.
keep
– {‘first’, ‘last’, False}, default ‘first’- Determines which duplicates (if any) to keep.
first
– Keep the first occurrence of duplicated entries.last
– Keep the last – occurrence of duplicated entries.False
– Remove all duplicates.
inplace
– bool, default False- An optional parameter that specifies whether to operate in place (modify the original Series) or return a new Series with duplicates removed. The default is False, which means the operation does not modify the original Series by default.
Return Value
It returns a series without having duplicate values.
Pandas Series Drop duplicates() Function
To drop duplicates from a Series of integers, you can use the drop_duplicates()
method in pandas.
First, let’s create a Pandas Series from a list.
import pandas as pd
# Create a Series with duplicate integers
series = pd.Series([5, 10, 15, 5, 10, 20, 30, 20])
print("Original Series:\n",series)
Yields below output.
As you can see, the original Series contains duplicate integers (5, 10, and 20), but after applying drop_duplicates()
, those duplicates are removed, and the resulting Series contains only unique integers.
# Drop duplicates
result = series.drop_duplicates()
print("Series after dropping duplicates:\n", result)
In the above example, drop_duplicates()
removes the duplicate integers from the Series series
, resulting in a new Series with only unique integers.
Drop Duplicates and Keep the Last Occurrence
Alternatively, to drop duplicates from a Series while keeping the last occurrence, you can use the keep
parameter of the drop_duplicates()
method and set it to 'last'
.
# Drop duplicates keeping the last occurrence
result = series.drop_duplicates(keep='last')
print("Dropping duplicates and keeping the last occurrence:\n", result)
# Output:
# Dropping duplicates and keeping the last occurrence:
# 2 15
# 3 5
# 4 10
# 6 30
# 7 20
# dtype: int64
In the above example, drop_duplicates(keep='last')
removes the duplicate integers from the Series series
and retains only the last occurrence of each unique value.
Removing all Duplicate Values
When you set the keep
parameter to False
in the drop_duplicates()
method, it discards all sets of duplicated entries, effectively removing all duplicates from the Series.
# Drop all duplicates
result = series.drop_duplicates(keep=False)
print("Series after dropping all duplicates:\n", result)
# Output:
# Series after dropping all duplicates:
# 2 15
# 6 30
# dtype: int64
In the above example, drop_duplicates(keep=False)
removes all duplicates from the Series series
, resulting in a new Series with only unique values.
Using Resetting Index After Dropping Duplicates
After dropping duplicates from a Series, you may want to reset the index to maintain a clean, sequential index without any gaps. You can achieve this using the reset_index()
method after dropping duplicates.
# Resetting index after dropping duplicates
result = series.drop_duplicates()
result = result.reset_index(drop=True)
print("After dropping duplicates and resetting index:\n", result)
# Output:
# After dropping duplicates and resetting index:
# 0 5
# 1 10
# 2 15
# 3 20
# 4 30
# dtype: int64
In the above example, reset_index(drop=True)
is used to reset the index of the resulting Series after dropping duplicates. The parameter drop=True
is used to discard the old index and create a new sequential index starting from 0.
Drop Duplicates from a Series with NaN Values
Similarly, you can drop duplicates from a Series containing NaN values using the drop_duplicates()
method. By default, pandas treats NaN values as distinct, so they will not be removed automatically unless explicitly specified.
import pandas as pd
import numpy as np
# Create a Series with duplicate integers including NaN values
series = pd.Series([5, 10, np.nan, 5, 10, 20, 30, 20])
# Drop duplicates
result = series.drop_duplicates()
print("Series after dropping duplicates:\n", result)
# Output:
# Series after dropping duplicates:
# 0 5.0
# 1 10.0
# 2 NaN
# 5 20.0
# 6 30.0
# dtype: float64
In the above example, drop_duplicates()
removes the duplicate values from the Series series
, including NaN values, and returns a new Series with only unique values.
Drop Duplicates from a Series of Strings
You can drop duplicates from a Series of strings using the drop_duplicates()
method in pandas. For instance, drop_duplicates()
removes the duplicate strings from the Series series
, resulting in a new Series with only unique strings.
import pandas as pd
# Create a Series with duplicate strings
series = pd.Series(['Spark', 'Pandas', 'Python', 'Pandas', 'PySpark'])
# Drop duplicates
result = series.drop_duplicates()
print("Dropping duplicates strings:\n", result)
# Output:
# Dropping duplicates strings:
# 0 Spark
# 1 Pandas
# 2 Python
# 4 PySpark
# dtype: object
Frequently Asked Questions on Pandas Series drop duplicates() Function
The purpose of the drop_duplicates()
function is to remove duplicate values from a pandas Series, ensuring that each unique value appears only once in the resulting Series.
drop_duplicates()
removes duplicate values from the Series based on specified criteria, such as keeping the first occurrence, the last occurrence, or removing all duplicates.
drop_duplicates()
can handle NaN values. By default, NaN values are treated as distinct and are retained unless explicitly removed using the dropna()
function.
You can remove duplicates and reset the index in one step by chaining the drop_duplicates()
and reset_index()
methods together.
By default, drop_duplicates()
returns a new Series with duplicates removed without modifying the original Series. However, you can use the inplace=True
parameter to perform the operation in place and modify the original Series.
You can specify custom criteria by defining a custom function and passing it to the keep
parameter. The custom function should return True or False to indicate whether to keep the duplicate.
Conclusion
In this article, I have explained the Series drop_duplicates()
function in pandas that provides a convenient way to remove duplicate values from a Series object. It offers flexibility in handling various types of data and allows for customization through optional parameters.
Happy Learning!!
Related Articles
- Pandas Series.quantile() Function
- Pandas Series.diff() Function
- Use pandas.to_numeric() Function
- Pandas Series where() Function
- Pandas Series astype() Function
- Pandas Series concat() Function
- Pandas Series.max() Function
- Pandas Series.shift() Function
- Pandas Series any() Function
- Pandas Series.clip() Function
- Pandas series.str.get() Function
- Pandas Series map() Function
- Pandas Series.dtype() Function