Pandas DataFrame.drop_duplicates()
function is used to remove duplicates from the DataFrame rows and columns. When data preprocessing and analysis step, data scientists need to check for any duplicate data is present, if so need to figure out a way to remove the duplicates.
Key Points –
drop_duplicates()
is used to remove duplicate rows from a DataFrame.- You can specify which columns to check for duplicates using the
subset
parameter. - By default,
drop_duplicates()
keeps the first occurrence of each duplicate row, but you can change this behavior with thekeep
parameter (e.g., ‘last’ orFalse
to drop all duplicates). - You can drop duplicates based on the index, but this requires setting the
subset
parameter to include the index. - The data types of the remaining rows after dropping duplicates are preserved.
drop_duplicates()
works with DataFrames that have a MultiIndex, applying the duplicate check on the specified levels.
Syntax of DataFrame.drop_duplicates()
Following is the syntax of the drop_duplicates()
function. It takes subset
, keep
, inplace
and ignore_index
as params and returns DataFrame with duplicate rows removed based on the parameters passed. If inplace=True
is used, it updates the existing DataFrame object and returns None
.
# Syntax of DataFrame.drop_duplicates()
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Following are the parameters of drop_duplicates.
subset
– Default isNone
. Specifies the column(s) to consider for identifying duplicates. IfNone
, it considers all columns.keep
–first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.False
: Drop all duplicates.inplacebool, default False.
inplace
– Default isFalse
. IfTrue
, the operation is performed in place, and the original DataFrame is modified. IfFalse
, a new DataFrame is returned.ignore_index
– If True the resulting axis will be labeled 0, 1, …, n – 1.
Considering certain columns is optional. Indexes, including time indexes are ignored. Parameter subset column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. keep{‘first’, ‘last’, False}, and default ‘first’. keep parameter determines which duplicates (if any) to keep.
Whether to drop duplicates in place or to return a copy.ignore_indexbool, default is False. If True means the resulting axis will be labeled 0, 1, …, n – 1.
Drop Duplicates in DataFrame
To run some examples of pandas DataFrame.drop_duplicates()
function, let’s create a Pandas DataFrame.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","PySpark","Pandas"],
'Fee' :[20000,22000,22000,30000],
'Duration':['30days','35days','35days','50days'],
}
# Create dataframe
df = pd.DataFrame(technologies)
print(df)
Below is the data frame with duplicates.
# Output:
Courses Fee Duration
0 Spark 20000 30days
1 PySpark 22000 35days
2 PySpark 22000 35days
3 Pandas 30000 50days
Now applying the drop_duplicates()
function on the data frame as shown below, drops the duplicate rows.
# Drop duplicates
df1 = df.drop_duplicates()
print(df1)
Following is the output.
# Output:
Courses Fee Duration
0 Spark 20000 30days
1 PySpark 22000 35days
3 Pandas 30000 50days
Drop Duplicates on Selected Columns
Use subset param, to drop duplicates on certain selected columns. This is an optional param. By default, it is None, which means using all of the columns for dropping duplicates.
# Using subset option
df3 = df.drop_duplicates(subset=['Courses'])
print(df3)
# Output:
Courses Fee Duration
0 Spark 20000 30days
1 PySpark 22000 35days
3 Pandas 30000 50days
FAQ on pandas.DataFrame.drop_duplicates()
The drop_duplicates()
function removes rows that are identical to a previous row, keeping the first occurrence by default.
To remove duplicates from specific columns in a Pandas DataFrame, you can use the drop_duplicates()
function with the subset
parameter. This allows you to specify which columns to consider when identifying duplicates, while keeping the first or last occurrence of each duplicate.
You can keep the last occurrence instead of the first when removing duplicates in a Pandas DataFrame. To do this, you can use the drop_duplicates()
function with the keep='last'
argument.
By default, the drop_duplicates()
operation is not in-place, meaning it returns a new DataFrame with the duplicates removed, while leaving the original DataFrame unchanged. If you want to modify the DataFrame in-place (i.e., remove the duplicates directly from the original DataFrame without creating a new one), you can set the inplace
parameter to True
.
The function works the same way for MultiIndex DataFrames, but you can specify which level(s) of the index you want to consider by using the subset
parameter.
Conclusion
In this article, you have learned how to drop/remove/delete duplicates using pandas.DataFrame.drop_duplicates()
. And also learned how to use option subset.
Related Articles
- Pandas Drop Rows by Index
- Pandas Drop Index Column Explained
- Pandas Get List of All Duplicate Rows
- Pandas Drop the First Row of DataFrame
- Pandas Drop First Column From DataFrame
- Pandas Drop Last Column From DataFrame
- Drop Duplicate Rows From Pandas DataFrame
- Pandas Drop Multiple Columns From DataFrame
Reference
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html