Pandas DataFrame.duplicated()
function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns. In this article, I will explain these with several examples.
1. Quick Examples of Get List of All Duplicate Items
If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.
# Below are quick example
# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]
# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
Now, let’s create a DataFrame with a few duplicate rows on all columns. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','40days','30days','50days'],
'Discount':[1000,2300,1200,2000,2300,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Python 22000 35days 1200
3 pandas 30000 50days 2000
4 Python 22000 40days 2300
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
2. Select Duplicate Rows Based on All Columns
You can use df[df.duplicated()]
without any arguments to get rows with the same values on all columns. It takes defaults values subset=None
and keep=‘first’
. The below example returns two rows as these are duplicate rows in our DataFrame.
# Select duplicate rows of all columns
df2 = df[df.duplicated()]
print(df2)
Yields below output.
Courses Fee Duration Discount
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
You can set 'keep=False'
in the duplicated function to get all the duplicate items without eliminating duplicate rows.
# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
3 pandas 30000 50days 2000
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
3. Get List of Duplicate Last Rows Based on All Columns
You want to select all the duplicate rows except their last occurrence, we must pass a keep
argument as ”last"
. For instance, df[df.duplicated(keep='last')]
.
# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
3 pandas 30000 50days 2000
4. Get List Of Duplicate Rows Using Single Columns
You want to select duplicate rows based on single columns then pass the column name as an argument.
# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)
# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)
Yields below output.
Courses Fee Duration Discount
4 Python 22000 40days 2300
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
5. Get List Of Duplicate Rows Using Multiple Columns
To get/find duplicate rows on the basis of multiple columns, specify all column names as a list.
# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)
# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)
Yields below output.
Courses Fee Duration Discount
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
6. Get List Of Duplicate Rows Using Sort Values
Let’s see how to sort the results of duplicated() method. You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values()
method.
# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)
Yields below output.
Courses Fee Duration Discount
5 Spark 20000 30days 1000
6 pandas 30000 50days 2000
4 Python 22000 40days 2300
You can use sort_values("Discount")
instead to sort after duplicate filter.
# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
5 Spark 20000 30days 1000
3 pandas 30000 50days 2000
6 pandas 30000 50days 2000
1 PySpark 25000 40days 2300
4 Python 22000 40days 2300
7. Complete Example For Get List of All Duplicate Items
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
'Fee' :[20000,25000,22000,30000,22000,20000,30000],
'Duration':['30days','40days','35days','50days','40days','30days','50days'],
'Discount':[1000,2300,1200,2000,2300,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]
# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)
# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)
# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)
# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)
# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)
# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)
# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)
# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)
Conclusion
In this article, you have learned how to get/select a list of all duplicate rows (all or multiple columns) using pandas DataFrame duplicated()
method with examples.
Happy Learning !!
Related Articles
- Select Rows From List of Values in Pandas DataFrame
- Set Order of Columns in Pandas DataFrame
- Pandas Add Constant Column to DataFrame
- Rename Index Values of Pandas DataFrame
- Pandas Rename Index of DataFrame
- pandas.DataFrame.drop_duplicates() – Examples
- Pandas.Index.drop_duplicates() Explained
- How to Drop Duplicate Columns in pandas DataFrame