Pandas Get List of All Duplicate Rows

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:January 26, 2023
Spread the love

Pandas DataFrame.duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns. In this article, I will explain these with several examples.

1. Quick Examples of Get List of All Duplicate Items

If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.


# Below are the quick examples

# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]

Now, let’s create a DataFrame with a few duplicate rows on all columns. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


# Create DataFrame
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  PySpark  25000   40days      2300
2   Python  22000   35days      1200
3   pandas  30000   50days      2000
4   Python  22000   40days      2300
5    Spark  20000   30days      1000
6   pandas  30000   50days      2000

2. Select Duplicate Rows Based on All Columns

You can use df[df.duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns two rows as these are duplicate rows in our DataFrame.


# Select duplicate rows of all columns
df2 = df[df.duplicated()]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

You can set 'keep=False' in the duplicated function to get all the duplicate items without eliminating duplicate rows.


# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

3. Get List of Duplicate Last Rows Based on All Columns

You want to select all the duplicate rows except their last occurrence, we must pass a keep argument as ”last". For instance, df[df.duplicated(keep='last')].


# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000

4. Get List Of Duplicate Rows Using Single Columns

You want to select duplicate rows based on single columns then pass the column name as an argument.


# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
4  Python  22000   40days      2300
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

5. Get List Of Duplicate Rows Using Multiple Columns

To get/find duplicate rows on the basis of multiple columns, specify all column names as a list.


# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

6. Get List Of Duplicate Rows Using Sort Values

Let’s see how to sort the results of duplicated() method. You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method.


# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000
4  Python  22000   40days      2300

You can use sort_values("Discount") instead to sort after duplicate filter.


# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
5    Spark  20000   30days      1000
3   pandas  30000   50days      2000
6   pandas  30000   50days      2000
1  PySpark  25000   40days      2300
4   Python  22000   40days      2300

7. Complete Example For Get List of All Duplicate Items


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)

# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

8. Conclusion

In this article, you have learned how to get/select a list of all duplicate rows (all or multiple columns) using pandas DataFrame duplicated() method with examples.

Happy Learning !!

References

Malli

I am Mallikarjuna an experienced technical writer with a passion for translating complex Python concepts into clear, concise, and user-friendly documentation. Over the years, I have written hundreds of articles in Pandas, NumPy, Python, and I take pride in my ability to bridge the gap between technical experts and end-users by delivering well-structured, accessible, and informative content.

Leave a Reply

You are currently viewing Pandas Get List of All Duplicate Rows