By using pandas.DataFrame.drop_duplicates()
method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns. In this article, we’ll explain several ways of how to drop duplicate rows from Pandas DataFrame with examples by using functions like DataFrame.drop_duplicates()
, DataFrame.apply()
and lambda function with examples.
Related: Pandas Get List of All Duplicate Rows
1. Quick Examples of Drop Duplicate Rows
If you are in a hurry, below are some quick examples of how to drop duplicate rows in pandas DataFrame.
# Below are quick example
# keep first duplicate row
df2 = df.drop_duplicates()
# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
# Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
# Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
2. drop_duplicates() Syntax & Examples
Below is the syntax of the DataFrame.drop_duplicates()
function that removes duplicate rows from the pandas DataFrame.
# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset
– Column label or sequence of labels. It’s default value is none. After passing columns, consider for identifying duplicate rows.keep
– Allowed values are {‘first’, ‘last’, False}, default ‘first’.‘first’
– Duplicate rows except for the first one is drop.‘last'
– Duplicate rows except for the last one is drop.False
– All duplicate rows are drop.
inplace
– Boolean value. removes rows with duplicates on existing DataFrame when it is True. By default False.ignore_index
– Boolean value, by default False.
Now, let’s create a DataFrame with a few duplicate rows on columns. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','35days','30days','50days'],
'Discount':[1000,2300,1200,2000,1200,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Python 22000 35days 1200
3 pandas 30000 50days 2000
4 Python 22000 35days 1200
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
3. Pandas Drop Duplicate Rows
You can use DataFrame.drop_duplicates()
without any arguments to drop rows with the same values on all columns. It takes defaults values subset=None
and keep=‘first’
. The below example returns four rows after removing duplicate rows in our DataFrame.
# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)
# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print(df2)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Python 22000 35days 1200
3 pandas 30000 50days 2000
4. Drop Duplicate Rows and Keep the Last Row
If you want to select all the duplicate rows and their last occurrence, you must pass a keep argument as "last"
. For instance, df.drop_duplicates(keep='last')
.
# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)
Yields below output.
Courses Fee Duration Discount
1 PySpark 25000 40days 2300
4 Python 22000 35days 1200
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
5. Remove All Duplicate Rows from Pandas DataFrame
You can set 'keep=False'
in the drop_duplicates()
function to remove all the duplicate rows. For E.x, df.drop_duplicates(keep=False)
.
# Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
print(df2)
Yields below output.
Courses Fee Duration Discount
1 PySpark 25000 40days 2300
6. Delete Duplicate Rows based on Specific Columns
To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False'
in the drop_duplicates()
function to remove all the duplicate rows.
# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)
Yields the same output as above.
7. Drop Duplicate Rows In Place
# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Python 22000 35days 1200
3 pandas 30000 50days 2000
8. Remove Duplicate Rows Using DataFrame.apply() and Lambda Function
You can remove duplicate rows using DataFrame.apply()
and lambda
function to convert the DataFrame to lower case and then apply lower string.
# Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)
Yields same output as above.
9. Complete Example For Drop Duplicate Rows in DataFrame
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','35days','30days','50days'],
'Discount':[1000,2300,1200,2000,1200,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
# keep first duplicate row
df2 = df.drop_duplicates()
print(df2)
# Using DataFrame.drop_duplicates() to keep first duplicate row
df2 = df.drop_duplicates(keep='first')
print(df2)
# keep last duplicate row
df2 = df.drop_duplicates( keep='last')
print(df2)
# Remove all duplicate rows
df2 = df.drop_duplicates(keep=False)
print(df2)
# Delete duplicate rows based on specific columns
df2 = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)
print(df2)
# Drop duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)
# Using DataFrame.apply() and lambda function
df2 = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')
print(df2)
Conclusion
In this article, you have learned how to drop/remove/delete duplicate rows using pandas.DataFrame.drop_duplicates()
, DataFrame.apply()
and lambda
function with examples.
Happy Learning !!
Related Articles
- Pandas Find Unique Values From Columns
- Convert Row to Column Header in Pandas DataFrame
- Retrieve Number of Rows From Pandas DataFrame
- Replace Column value in Pandas DataFrame
- Rename Specific Columns in Pandas
- Pandas Get List of All Duplicate Rows
- How to Drop Duplicate Columns in pandas DataFrame
- pandas.DataFrame.drop_duplicates() – Examples