How to Count Duplicates in Pandas DataFrame

You can count duplicates in pandas DataFrame by using DataFrame.pivot_table() function. This function counts the number of duplicate entries in a single column, multiple columns, and count duplicates when having NaN values in the DataFrame. In this article, I will explain how to count duplicates in pandas DataFrame with examples.

1. Quick Examples of Count Duplicates in DataFrame

If you are in a hurry, below are some quick examples of how to count duplicates in DataFrame.



# Example 1: Get count duplicates single column 
# using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')

# Example 2: Get count duplicates multiple columns 
# using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')

# Example 3: Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 4: Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 5: Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())

# Example 6: Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())

# Example 7: Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000      NaN      2300
2   Hadoop  22000   50days      1000
3   Pandas  24000   40days      2300
4    Spark  22000   30days      2500
5   Pandas  24000     None      1500
6   Hadoop  25000      NaN      2000

2. Pandas Count Duplicates

We can use DataFrame.pivot_table() function to count the duplicates in a single column. Set index parameter as a list with a column along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of a specified single column of a given DataFrame.


# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)

Yields below output.


# Output
Courses
Hadoop     2
Pandas     2
PySpark    1
Spark      2
dtype: int64

3. Get Count Duplicates of Multiple Columns

We can also use DataFrame.pivot_table() function to count the duplicates in multiple columns. For that, set index parameter as a list of multiple columns along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of specified multiple columns of a given DataFrame.


# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)

Yields below output.


# Output
Courses  Fee  
Hadoop   22000    1
         25000    1
Pandas   24000    2
PySpark  25000    1
Spark    22000    2
dtype: int64

4. Get Count Duplicates When having NaN Values

Let’s also learn how to count duplicate NaN values in a DataFrame using pivot_table() function. First, let’s see what happens when we have NaN values on a column you are checking for duplicates.


# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

Yields below output. Notice that this function has ignored all duplicate NaN values.


# Output
Duration
30days    2
40days    1
50days    1
dtype: int64

5. Get Count Duplicate null Values Using fillna()

We can use fillna() function to assign null value for a NaN and then call the pivot_table() function, It will return the count of the duplicate null values of a given DataFrame.


# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

Yields below output.


# Output
Duration
30days    2
40days    1
50days    1
NULL      3
dtype: int64

6. Get Count Duplicate Rows in Pandas DataFrame

Similarly, If you like to count duplicates on particular row or entire DataFrame using len() function, this will return the count of the duplicate single rows and entire DataFrame.


# count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Output
3

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Output
0

7. Get Count Duplicates for Each Unique Row

You can also use groupby on all the columns and call size to get the duplicate values. It will return the count of the duplicate values of each unique row of a given DataFrame. For examples,


# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

# Output
  Courses    Fee Duration  Discount  size
0  Hadoop  22000   50days      1000     1
1  Pandas  24000   40days      2300     1
2   Spark  22000   30days      1000     1
3   Spark  22000   30days      2500     1

8. Complete Example For Count Duplicates in DataFrame


import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print(df)

# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)

# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)

# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

9. Conclusion

In this article, I have explained how to count duplicates in pandas DataFrame by using DataFrame.pivot_table() function with examples like counting duplicates from a single column, multiple columns, counting with NaN e.t.c.

Happy Learning !!

Related Articles

References

Leave a Reply