Pandas Get List of All Duplicate Rows

Pandas DataFrame.duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns. In this article, I will explain these with several examples.

1. Quick Examples of Get List of All Duplicate Items

If you are in a hurry, below are some quick examples of how to get a list of all duplicate rows in pandas DataFrame.


# Quick examples of get list of all duplicate items

# Select duplicate rows except first occurrence 
# Based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]

Now, let’s create a DataFrame with a few duplicate rows on all columns. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


# Create DataFrame
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print("Create DataFrame\n",df)

Yields below output.

2. Select Duplicate Rows Based on All Columns

You can use df[df.duplicated()] without any arguments to get rows with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns two rows as these are duplicate rows in our DataFrame.

In this example, the duplicated() method without specifying a subset will consider all columns for identifying duplicate rows. The resulting DataFrame df2 will contain the duplicate rows based on all columns.


# Select duplicate rows of all columns
df2 = df[df.duplicated()]
print("Duplicate rows based on all columns:\n",df2)

Yields below output.

You can set 'keep=False' in the duplicated function to get all the duplicate items without eliminating duplicate rows.


# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

3. Get List of Duplicate Last Rows Based on All Columns

If you want to get a list of the last occurrence of duplicate rows based on all columns in a Pandas DataFrame, you can use the duplicated() method with the keep='last' parameter. This will mark all occurrences of duplicates as True except for the last one

In the below example, the duplicated(keep='last') method marks all occurrences of duplicate rows as True except for the last one. The resulting DataFrame df2 will contain only the last occurrence of each duplicate row based on all columns.


# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print("Duplicate last rows based on all columns:\n",df2)

Yields below output.


# Output:
# Duplicate last rows based on all columns:
   Courses    Fee Duration  Discount
0   Spark  20000   30days      1000
3  pandas  30000   50days      2000

4. Get List of Duplicate Rows Using Single Columns

You want to select duplicate rows based on single columns then pass the column name as an argument. For instance, create a DataFrame (df2) containing all rows where the ‘Courses’ column has duplicate values. The duplicated() method without specifying keep defaults to marking all occurrences of duplicates as True, keeping the first occurrence and marking subsequent occurrences.


# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
4  Python  22000   40days      2300
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

5. Get List of Duplicate Rows Using Multiple Columns

If you want to get a list of duplicate rows based on multiple columns in a Pandas DataFrame, you can use the duplicated() method with the subset parameter and specify the list of columns you want to consider for identifying duplicates.

In the below example, the df.duplicated(subset=['Courses','Fee','Duration']) method identifies duplicate rows based on both the ‘Courses’, ‘Fee’, and ‘Duration’ columns. The resulting DataFrame df2 will contain all rows that have duplicates based on both columns.


# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(subset=['Courses','Fee','Duration'])]
print(df2)

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000

6. Get List of Duplicate Rows Using Sort Values

Let’s see how to sort the results of duplicated() method. You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method.


# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

Yields below output.


# Output:
  Courses    Fee Duration  Discount
5   Spark  20000   30days      1000
6  pandas  30000   50days      2000
4  Python  22000   40days      2300

You can use sort_values("Discount") instead to sort after duplicate filter.


# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
5    Spark  20000   30days      1000
3   pandas  30000   50days      2000
6   pandas  30000   50days      2000
1  PySpark  25000   40days      2300
4   Python  22000   40days      2300

7. Complete Example For Get List of All Duplicate Items


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Select duplicate rows except first occurrence based on all columns
df2 = df[df.duplicated()]

# Select duplicate row based on all columns
df2 = df[df.duplicated(keep=False)]
print(df2)

# Get duplicate last rows based on all columns
df2 = df[df.duplicated(keep = 'last')]
print(df2)

# Get list Of duplicate rows using single columns
df2 = df[df['Courses'].duplicated() == True]
print(df2)

# Get list of duplicate rows based on 'Courses' column
df2 = df[df.duplicated('Courses')]
print(df2)

# Get list Of duplicate rows using multiple columns
df2 = df[df[['Courses', 'Fee','Duration']].duplicated() == True]
print(df2)

# Get list of duplicate rows based on list of column names
df2 = df[df.duplicated(['Courses','Fee','Duration'])]
print(df2)

# Get list Of duplicate rows using sort values
df2 = df[df.duplicated(['Discount'])==True].sort_values('Discount')
print(df2)

# Using sort values
df2 = df[df.Discount.duplicated(keep=False)].sort_values("Discount")
print(df2)

Frequently Asked Questions on Pandas Get List of All Duplicate Rows

How can I find duplicate rows in a Pandas DataFrame based on a specific column?

To find duplicate rows in a Pandas DataFrame based on a specific column, you can use the duplicated() method with the subset parameter. For example, the duplicated(subset=['ID']) method identifies duplicate rows based on the values in the ‘ID’ column. The resulting DataFrame duplicate_rows will contain all rows that have duplicates in the ‘ID’ column.

Can I identify duplicate rows based on multiple columns?

You can identify duplicate rows based on multiple columns in a Pandas DataFrame using the duplicated() method with the subset parameter. Simply provide a list of column names to the subset parameter.

How can I get a list of all duplicate rows based on all columns?

To get a list of all duplicate rows based on all columns in a Pandas DataFrame, you can use the duplicated() method without specifying the subset parameter.

Can I find duplicate rows based on the entire DataFrame, not just specific columns?

You can find duplicate rows based on the entire DataFrame, not just specific columns, by using the duplicated() method without specifying the subset parameter. This approach considers the entire row when identifying duplicates.

What if I want to find the last occurrence of duplicate rows?

If you want to find the last occurrence of duplicate rows in a Pandas DataFrame, you can use the duplicated() method with the keep='last' parameter. This will mark all occurrences of duplicates as True except for the last one.

Is it possible to sort the DataFrame before identifying duplicate rows?

It’s possible to sort the DataFrame before identifying duplicate rows. You can use the sort_values method to sort the DataFrame based on one or more columns. Sorting the DataFrame can be helpful if you want to group duplicate rows together or if you want to identify duplicates in a specific order.

Conclusion

In this article, you have learned how to get/select a list of all duplicate rows (all or multiple columns) using pandas DataFrame duplicated() method with examples.

Happy Learning !!

References

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

1. Quick Examples of Get List of All Duplicate Items

2. Select Duplicate Rows Based on All Columns

3. Get List of Duplicate Last Rows Based on All Columns

4. Get List of Duplicate Rows Using Single Columns

5. Get List of Duplicate Rows Using Multiple Columns

6. Get List of Duplicate Rows Using Sort Values

7. Complete Example For Get List of All Duplicate Items

Frequently Asked Questions on Pandas Get List of All Duplicate Rows

Conclusion

Related Articles

References

Leave a Reply