To drop duplicate rows in a pandas DataFrame, you can use the drop_duplicates()
method. Using this method you can drop duplicate rows on selected multiple columns or all columns.
In this article, we’ll explain several ways of dropping duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates()
, DataFrame.apply()
and lambda
function.
Key Points –
- The
drop_duplicates()
method in Pandas is used to remove duplicate rows from a DataFrame. - By default,
drop_duplicates()
keeps the first occurrence of each duplicated row and removes subsequent ones. - The
subset
parameter allows specifying specific columns to check for duplicates instead of the entire DataFrame. - The
keep
parameter controls which duplicates to keep: “first” for the first occurrence, “last” for the last, orFalse
to remove all duplicates. - Setting
inplace=True
modifies the original DataFrame directly instead of creating a new one with duplicates removed. - By default,
drop_duplicates()
returns a new DataFrame with duplicates removed, unlessinplace=True
is specified.
Quick Examples of Dropping Duplicate Rows
Below are quick examples of dropping duplicate rows in Pandas DataFrame.
# Quick examples of drop duplicate rows
# Example 1: Keep first duplicate row
df2 = df.drop_duplicates()
# Example 2: Using DataFrame.drop_duplicates()
# To keep first duplicate row
df2 = df.drop_duplicates(keep='first')
# Example 3: Keep last duplicate row
df2 = df.drop_duplicates(keep='last')
# Example 4: Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
# Example 5: Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
# Example 6: Drop duplicate rows in place
df.drop_duplicates(inplace=True)
# Example 7: Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
drop_duplicates() Syntax & Examples
Following is the syntax of the DataFrame.drop_duplicates()
function that removes duplicate rows from the pandas DataFrame.
# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset
– Specifies the subset of columns to consider when identifying duplicates. IfNone
(default), all columns are considered.keep
– Allowed values are {‘first’, ‘last’, False}, default ‘first’.‘first’
– Duplicate rows except for the first one are dropped.‘last'
– Duplicate rows except for the last one are dropped.False
– All duplicate rows are dropped.
inplace
– Boolean value. removes rows with duplicates on existing DataFrame when it is True. By default False.ignore_index
– Boolean value, by default False.
To run some examples of dropping duplicate rows in Pandas DataFrame, let’s create Pandas DataFrame using data from a dictionary.
# drop_duplicates() Syntax & Examples
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','35days','30days','50days'],
'Discount':[1000,2300,1200,2000,1200,1000,2000]
}
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)
Yields below output.
Pandas Drop Duplicate Rows
To drop duplicate rows in a Pandas DataFrame, you can use the drop_duplicates()
method. By default, this method removes rows with identical values across all columns while keeping the first occurrence.
# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)
# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print("After dropping duplicate rows:\n", df2)
In the above example, drop_duplicates()
is used without any arguments, which implies subset=None
and keep='first'
. Therefore, it removes duplicate rows based on all columns while keeping the first occurrence, resulting in a DataFrame with only unique rows. This example yields the below output.
Drop Duplicate Rows and Keep the Last Row
If you want to select all the duplicate rows and their last occurrence, you must pass a keep argument as "last"
. For instance, df.drop_duplicates(keep='last')
.
# Keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)
# Output:
# Courses Fee Duration Discount
# 1 PySpark 25000 40days 2300
# 4 Python 22000 35days 1200
# 5 Spark 20000 30days 1000
# 6 pandas 30000 50days 2000
Remove All Duplicate Rows from Pandas DataFrame
You can set 'keep=False'
in the drop_duplicates()
function to remove all the duplicate rows. For E.x, df.drop_duplicates(keep=False)
.
# Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
print(df2)
# Output:
# Courses Fee Duration Discount
# 1 PySpark 25000 40days 2300
Delete Duplicate Rows based on Specific Columns
To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False'
in the drop_duplicates()
function to remove all the duplicate rows.
# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)
# Output:
# Courses Fee Duration Discount
# 1 PySpark 25000 40days 2300
Drop Duplicate Rows In Place
To drop duplicate rows in place in a Pandas DataFrame, you can use the drop_duplicates()
method with the inplace=True
parameter. This will modify the original DataFrame instead of returning a new DataFrame.
# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)
# Output:
# Courses Fee Duration Discount
# 0 Spark 20000 30days 1000
# 1 PySpark 25000 40days 2300
# 2 Python 22000 35days 1200
# 3 pandas 30000 50days 2000
Keep in mind that using inplace=True
directly modifies the original DataFrame and does not return a new DataFrame. Therefore, there’s no need to assign the result to a new variable. If you want to keep the original DataFrame unchanged and create a new one with duplicate rows dropped, you can omit inplace=True
.
Remove Duplicate Rows Using apply() and Lambda Function
You can remove duplicate rows using DataFrame.apply()
and lambda
function to convert the DataFrame to lower case and then apply a lower string.
# Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)
# Output:
# Courses Fee Duration Discount
# 0 Spark 20000 30days 1000
# 1 PySpark 25000 40days 2300
# 2 Python 22000 35days 1200
# 3 pandas 30000 50days 2000
Complete Example For Drop Duplicate Rows in DataFrame
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','35days','30days','50days'],
'Discount':[1000,2300,1200,2000,1200,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)
# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print(df2)
# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)
# Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
print(df2)
# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)
# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)
# Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)
FAQ on Drop Duplicate Rows in DataFrame
You can use the drop_duplicates()
method. By default, it removes duplicate rows based on all columns, keeping the first occurrence.
Dropping duplicate rows means removing rows from a DataFrame that have the same values across all columns or a specified subset of columns.
You can drop duplicates based on specific columns by specifying those columns in the subset
parameter of drop_duplicates()
.
You can reset the index after dropping duplicates by using the ignore_index=True
parameter within the drop_duplicates()
method. This will reindex the resulting DataFrame, providing a new sequential index starting from 0.
Conclusion
In this article, you learned various methods to drop or remove duplicate rows from a Pandas DataFrame. We explored using pandas.DataFrame.drop_duplicates()
with different parameters to customize the behavior, as well as leveraging DataFrame.apply()
and lambda functions for more advanced scenarios.
Happy Learning !!
Related Articles
- Pandas Drop Rows by Index
- How to drop rows with NaN values?
- Drop Pandas rows with condition
- Pandas Drop Rows Based on Column Value
- Pandas – Drop List of Rows From DataFrame
- Pandas Drop Last N Rows From DataFrame
- pandas.DataFrame.drop_duplicates() – Examples
- How to drop the first row from the Pandas DataFrame
- Pandas Drop the First Three Rows From DataFrame
- How to Drop Rows From Pandas DataFrame Examples