• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:11 mins read
You are currently viewing How to Count Duplicates in Pandas DataFrame

You can count duplicates in pandas DataFrame by using DataFrame.pivot_table() function. This function counts the number of duplicate entries in a single column, or multiple columns, and counts duplicates when having NaN values in the DataFrame. In this article, I will explain how to count duplicates in pandas DataFrame with examples.

1. Quick Examples of Count Duplicates in DataFrame

If you are in a hurry, below are some quick examples of how to count duplicates in DataFrame.


# Below are the quick examples

# Example 1: Get count duplicates single column 
# Using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')

# Example 2: Get count duplicates multiple columns 
# Using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')

# Example 3: Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 4: Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')

# Example 5: Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())

# Example 6: Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())

# Example 7: Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


# Get The Count Duplicates in DataFrame 
import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas dataframe count duplicates

2. Pandas Count Duplicates

You can use DataFrame.pivot_table() function to count the duplicates in a single column. Set index parameter as a list with a column along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of a specified single column of a given DataFrame.


# Get count of duplicates in a single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print("Get count of duplicate values in a single column:\n", df2)

Yields below output.

pandas dataframe count duplicates

3. Get Count Duplicates of Multiple Columns

You can also use DataFrame.pivot_table() function to count the duplicates in multiple columns. For that, set index parameter as a list of multiple columns along with aggfunc=size into pivot_table() function, it will return the count of the duplicate values of specified multiple columns of a given DataFrame.


# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print("Get count of duplicate values in multiple columns:\n", df2)

Yields below output.


# Output:
# Get count of duplicate values in multiple columns:
Courses  Fee  
Hadoop   22000    1
         25000    1
Pandas   24000    2
PySpark  25000    1
Spark    22000    2
dtype: int64

4. Get Count Duplicates When having NaN Values

To count duplicate values of a column which has NaN values in a DataFrame using pivot_table() function. First, let’s see what happens when we have NaN values on a column you are checking for duplicates.


# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print("Get count of duplicate values in a column of NaN values:\n", df2)

Yields below output. Notice that this function has ignored all duplicate NaN values.


# Output:
# Get count of duplicate values in a column of NaN values:
Duration
30days    2
40days    1
50days    1
dtype: int64

5. Get Count Duplicate null Values Using fillna()

You can use fillna() function to assign a null value for a NaN and then call the pivot_table() function, It will return the count of the duplicate null values of a given DataFrame.


# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print("Get count of duplicate values of NULL values:\n", df2)

Yields below output.


# Output:
# Get count of duplicate values of NULL values:
Duration
30days    2
40days    1
50days    1
NULL      3
dtype: int64

6. Get the Count of Duplicate Rows in Pandas DataFrame

Similarly, If you like to count duplicates on a particular row or entire DataFrame using the len() function, this will return the count of the duplicate single rows and the entire DataFrame.


# count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Output:
# 3

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Output:
# 0

7. Get Count Duplicates for Each Unique Row

You can also use groupby on all the columns and call size to get the duplicate values. It will return the count of the duplicate values of each unique row of a given DataFrame. For examples,


# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

# Output:
#   Courses    Fee Duration  Discount  size
# 0  Hadoop  22000   50days      1000     1
# 1  Pandas  24000   40days      2300     1
# 2   Spark  22000   30days      1000     1
# 3   Spark  22000   30days      2500     1

8. Complete Example For Count Duplicates in DataFrame


# Complete Example For Count Duplicates in DataFrame
import pandas as pd
import numpy as np
technologies= ({
    'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
    'Fee' :[22000,25000,22000,24000,22000,24000,25000],
    'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
    'Discount':[1000,2300,1000,2300,2500,1500,2000]
              })
df = pd.DataFrame(technologies)
print(df)

# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)

# Get count duplicates multiple columns using dataframe.pivot_table() 
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)

# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)

# Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)

# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)

# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)

9. Conclusion

In this article, I have explained how to count duplicates in pandas DataFrame by using DataFrame.pivot_table() function with examples like counting duplicates from a single column, multiple columns, counting with NaN etc.

Happy Learning !!

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium