Pandas Get List of All Duplicate Rows

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:January 26, 2023
Spread the love

Pandas DataFrame.duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns. In this article, I will explain these with several examples.

1. Quick Examples of Get List of All Duplicate Items

If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.


# Below are quick example
# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]

Now, let’s create a DataFrame with a few duplicate rows on all columns. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  PySpark  25000   40days      2300
2   Python  22000   35days      1200
3   pandas  30000   50days      2000
4   Python  22000   40days      2300
5    Spark  20000   30days      1000
6   pandas  30000   50days      2000

2. Select Duplicate Rows Based on All Columns

You can use df[df.duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns two rows as these are duplicate rows in our DataFrame.


# Select duplicate rows of all columns
df2 = df[df.duplicated()]
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

You can set 'keep=False' in the duplicated function to get all the duplicate items without eliminating duplicate rows.


# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

3. Get List of Duplicate Last Rows Based on All Columns

You want to select all the duplicate rows except their last occurrence, we must pass a keep argument as ”last". For instance, df[df.duplicated(keep='last')].


# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000

4. Get List Of Duplicate Rows Using Single Columns

You want to select duplicate rows based on single columns then pass the column name as an argument.


# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
4  Python  22000   40days      2300
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

5. Get List Of Duplicate Rows Using Multiple Columns

To get/find duplicate rows on the basis of multiple columns, specify all column names as a list.


# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

6. Get List Of Duplicate Rows Using Sort Values

Let’s see how to sort the results of duplicated() method. You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method.


# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000
4  Python  22000   40days      2300

You can use sort_values("Discount") instead to sort after duplicate filter.


# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
5    Spark  20000   30days      1000
3   pandas  30000   50days      2000
6   pandas  30000   50days      2000
1  PySpark  25000   40days      2300
4   Python  22000   40days      2300

7. Complete Example For Get List of All Duplicate Items


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)

# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

Conclusion

In this article, you have learned how to get/select a list of all duplicate rows (all or multiple columns) using pandas DataFrame duplicated() method with examples.

Happy Learning !!

References

Leave a Reply

You are currently viewing Pandas Get List of All Duplicate Rows