• Post author:
  • Post category:Pandas
  • Post last modified:August 6, 2024
  • Reading time:16 mins read
You are currently viewing Pandas Series Drop duplicates() Function

In pandas, drop_duplicates() is used to remove duplicates from the Series (get rid of repeated values from the Series). In this article, I’ll explain how to use the Series.drop_duplicates() function and show you the steps. By following these steps, you can make a new list that’s updated and doesn’t have any repeated values. And if your original Series doesn’t have any repeated values, it will return the original list without any changes.

Advertisements

Key Points –

  • drop_duplicates() is a method available for pandas Series objects that allow for the removal of duplicate values.
  • It operates on a Series and returns a new Series with unique values after removing duplicate values based on specified criteria.
  • The method supports parameters such as keep to determine which duplicates to retain (‘first’, ‘last’, or False for removing all duplicates) and inplace to operate on the original Series if True.
  • It is commonly used in data preprocessing tasks to ensure data quality by eliminating redundant or duplicated information.
  • The method’s flexibility and efficiency make it a valuable tool in data cleaning, analysis, and preparation workflows, contributing to more accurate and reliable results.

Syntax of Pandas Series.drop_duplicates() Function

Following is the syntax of the pandas Series.drop_duplicates() function.


# Syntax of Series.drop_duplicates() function
Series.drop_duplicates(keep='first', inplace=False)

Parameters of the Series.drop_duplicates()

Following are the parameters of the Series.drop_duplicates() function.

  • keep – {‘first’, ‘last’, False}, default ‘first’
    • Determines which duplicates (if any) to keep.
    • first – Keep the first occurrence of duplicated entries.
    • last – Keep the last – occurrence of duplicated entries.
    • False – Remove all duplicates.
  • inplace – bool, default False
    • An optional parameter that specifies whether to operate in place (modify the original Series) or return a new Series with duplicates removed. The default is False, which means the operation does not modify the original Series by default.

Return Value

It returns a series without having duplicate values.

Pandas Series Drop duplicates() Function

To drop duplicates from a Series of integers, you can use the drop_duplicates() method in pandas.

First, let’s create a Pandas Series from a list.


import pandas as pd

# Create a Series with duplicate integers
series = pd.Series([5, 10, 15, 5, 10, 20, 30, 20])
print("Original Series:\n",series)

Yields below output.

pandas series drop duplicates

As you can see, the original Series contains duplicate integers (5, 10, and 20), but after applying drop_duplicates(), those duplicates are removed, and the resulting Series contains only unique integers.


# Drop duplicates
result = series.drop_duplicates()
print("Series after dropping duplicates:\n", result)

In the above example, drop_duplicates() removes the duplicate integers from the Series series, resulting in a new Series with only unique integers.

pandas series drop duplicates

Drop Duplicates and Keep the Last Occurrence

Alternatively, to drop duplicates from a Series while keeping the last occurrence, you can use the keep parameter of the drop_duplicates() method and set it to 'last'.


# Drop duplicates keeping the last occurrence
result = series.drop_duplicates(keep='last')
print("Dropping duplicates and keeping the last occurrence:\n", result)

# Output:
# Dropping duplicates and keeping the last occurrence:
# 2    15
# 3     5
# 4    10
# 6    30
# 7    20
# dtype: int64

In the above example, drop_duplicates(keep='last') removes the duplicate integers from the Series series and retains only the last occurrence of each unique value.

Removing all Duplicate Values

When you set the keep parameter to False in the drop_duplicates() method, it discards all sets of duplicated entries, effectively removing all duplicates from the Series.


# Drop all duplicates
result = series.drop_duplicates(keep=False)
print("Series after dropping all duplicates:\n", result)

# Output:
# Series after dropping all duplicates:
# 2    15
# 6    30
# dtype: int64

In the above example, drop_duplicates(keep=False) removes all duplicates from the Series series, resulting in a new Series with only unique values.

Using Resetting Index After Dropping Duplicates

After dropping duplicates from a Series, you may want to reset the index to maintain a clean, sequential index without any gaps. You can achieve this using the reset_index() method after dropping duplicates.


# Resetting index after dropping duplicates
result = series.drop_duplicates()
result = result.reset_index(drop=True)
print("After dropping duplicates and resetting index:\n", result)

# Output:
# After dropping duplicates and resetting index:
# 0     5
# 1    10
# 2    15
# 3    20
# 4    30
# dtype: int64

In the above example, reset_index(drop=True) is used to reset the index of the resulting Series after dropping duplicates. The parameter drop=True is used to discard the old index and create a new sequential index starting from 0.

Drop Duplicates from a Series with NaN Values

Similarly, you can drop duplicates from a Series containing NaN values using the drop_duplicates() method. By default, pandas treats NaN values as distinct, so they will not be removed automatically unless explicitly specified.


import pandas as pd
import numpy as np

# Create a Series with duplicate integers including NaN values
series = pd.Series([5, 10, np.nan, 5, 10, 20, 30, 20])

# Drop duplicates
result = series.drop_duplicates()
print("Series after dropping duplicates:\n", result)

# Output:
# Series after dropping duplicates:
# 0     5.0
# 1    10.0
# 2     NaN
# 5    20.0
# 6    30.0
# dtype: float64

In the above example, drop_duplicates() removes the duplicate values from the Series series, including NaN values, and returns a new Series with only unique values.

Drop Duplicates from a Series of Strings

You can drop duplicates from a Series of strings using the drop_duplicates() method in pandas. For instance, drop_duplicates() removes the duplicate strings from the Series series, resulting in a new Series with only unique strings.


import pandas as pd

# Create a Series with duplicate strings
series = pd.Series(['Spark', 'Pandas', 'Python', 'Pandas', 'PySpark'])

# Drop duplicates
result = series.drop_duplicates()
print("Dropping duplicates strings:\n", result)

# Output:
# Dropping duplicates strings:
# 0      Spark
# 1     Pandas
# 2     Python
# 4    PySpark
# dtype: object

Frequently Asked Questions on Pandas Series drop duplicates() Function

What is the purpose of the drop_duplicates() function in pandas Series?

The purpose of the drop_duplicates() function is to remove duplicate values from a pandas Series, ensuring that each unique value appears only once in the resulting Series.

How does drop_duplicates() handle duplicate values?

drop_duplicates() removes duplicate values from the Series based on specified criteria, such as keeping the first occurrence, the last occurrence, or removing all duplicates.

Can drop_duplicates() handle NaN values?

drop_duplicates() can handle NaN values. By default, NaN values are treated as distinct and are retained unless explicitly removed using the dropna() function.

How can I remove duplicates and reset the index in one step?

You can remove duplicates and reset the index in one step by chaining the drop_duplicates() and reset_index() methods together.

Does drop_duplicates() modify the original Series?

By default, drop_duplicates() returns a new Series with duplicates removed without modifying the original Series. However, you can use the inplace=True parameter to perform the operation in place and modify the original Series.

Can I specify custom criteria for dropping duplicates?

You can specify custom criteria by defining a custom function and passing it to the keep parameter. The custom function should return True or False to indicate whether to keep the duplicate.

Conclusion

In this article, I have explained the Series drop_duplicates() function in pandas that provides a convenient way to remove duplicate values from a Series object. It offers flexibility in handling various types of data and allows for customization through optional parameters.

Happy Learning!!

Related Articles

References