• Post author:
  • Post category:Pandas
  • Post last modified:November 3, 2024
  • Reading time:12 mins read
You are currently viewing How to Count Duplicates in Pandas DataFrame

You can count duplicates in pandas DataFrame by using DataFrame.pivot_table() function. This function counts the number of duplicate entries in a single column, or multiple columns, and counts duplicates when having NaN values in the DataFrame. In this article, I will explain how to count duplicates in pandas DataFrame with examples.

Advertisements

Key Points –

  • Identify duplicates with .duplicated(): Use .duplicated() to find duplicate rows or specify columns to detect duplicates in specific fields.
  • Use .pivot_table() for grouped duplicates: Aggregate duplicates with .pivot_table(), which groups based on column values and provides counts.
  • Apply aggfunc='size' in .pivot_table() to count duplicates: Use .pivot_table() with size aggregation to get a breakdown of duplicates by one or more columns.
  • Count unique duplicates using .groupby(): Group by all columns or specific columns and use .size() to get counts for each unique row or value.
  • Handle NaN values with .fillna(): Replace NaNs with a placeholder value before counting duplicates to avoid NaN being treated as unique.

Quick Examples of Count Duplicates in DataFrame

If you are in a hurry, below are some quick examples of how to count duplicates in DataFrame.


# Quick examples of count duplicates in dataframe

# Example 1: Get count duplicates single column 
# Using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')

# Example 2: Get count duplicates multiple columns 
# Using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')

# Example 3: Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 4: Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 5: Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())

# Example 6: Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())

# Example 7: Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


# Get The Count Duplicates in DataFrame 
import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas dataframe count duplicates

Pandas Count Duplicates

You can use DataFrame.pivot_table() function to count the duplicates in a single column. Set index parameter as a list with a column along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of a specified single column of a given DataFrame.


# Get count of duplicates in a single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print("Get count of duplicate values in a single column:\n", df2)

Yields below output.

pandas dataframe count duplicates

Get Count Duplicates of Multiple Columns

You can also use DataFrame.pivot_table() function to count the duplicates in multiple columns. For that, set index parameter as a list of multiple columns along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of specified multiple columns of a given DataFrame.


# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print("Get count of duplicate values in multiple columns:\n", df2)

Yields below output.


# Output:
# Get count of duplicate values in multiple columns:
Courses  Fee  
Hadoop   22000    1
         25000    1
Pandas   24000    2
PySpark  25000    1
Spark    22000    2
dtype: int64

Get Count Duplicates When having NaN Values

To count duplicate values of a column which has NaN values in a DataFrame using pivot_table() function. First, let’s see what happens when we have NaN values on a column you are checking for duplicates.


# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print("Get count of duplicate values in a column of NaN values:\n", df2)

Yields below output. Notice that this function has ignored all duplicate NaN values.


# Output:
# Get count of duplicate values in a column of NaN values:
Duration
30days    2
40days    1
50days    1
dtype: int64

Get Count Duplicate null Values Using fillna()

You can use fillna() function to assign a null value for a NaN and then call the pivot_table() function, It will return the count of the duplicate null values of a given DataFrame.


# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print("Get count of duplicate values of NULL values:\n", df2)

Yields below output.


# Output:
# Get count of duplicate values of NULL values:
Duration
30days    2
40days    1
50days    1
NULL      3
dtype: int64

Get the Count of Duplicate Rows in Pandas DataFrame

Similarly, If you like to count duplicates on a particular row or entire DataFrame using the len() function, this will return the count of the duplicate single rows and the entire DataFrame.


# count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Output:
# 3

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Output:
# 0

Get Count Duplicates for Each Unique Row

You can also use groupby on all the columns and call size to get the duplicate values. It will return the count of the duplicate values of each unique row of a given DataFrame. For examples,


# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

# Output:
#   Courses    Fee Duration  Discount  size
# 0  Hadoop  22000   50days      1000     1
# 1  Pandas  24000   40days      2300     1
# 2   Spark  22000   30days      1000     1
# 3   Spark  22000   30days      2500     1

Complete Example For Count Duplicates in DataFrame


# Complete Example For Count Duplicates in DataFrame
import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print(df)

# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)

# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)

# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

Conclusion

In this article, I have explained how to count duplicates in pandas DataFrame by using DataFrame.pivot_table() function with examples like counting duplicates from a single column, multiple columns, counting with NaN etc.

Happy Learning !!

Related Articles

References