You can count duplicates in pandas DataFrame by using DataFrame.pivot_table()
function. This function counts the number of duplicate entries in a single column, multiple columns, and count duplicates when having NaN values in the DataFrame. In this article, I will explain how to count duplicates in pandas DataFrame with examples.
1. Quick Examples of Count Duplicates in DataFrame
If you are in a hurry, below are some quick examples of how to count duplicates in DataFrame.
# Below are the quick examples
# Example 1: Get count duplicates single column
# Using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
# Example 2: Get count duplicates multiple columns
# Using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
# Example 3: Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
# Example 4: Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
# Example 5: Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
# Example 6: Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
# Example 7: Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are Courses
, Fee
, Duration
and Discount
.
# Get The Count Duplicates in DataFrame
import pandas as pd
import numpy as np
technologies= ({
'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
'Fee' :[22000,25000,22000,24000,22000,24000,25000],
'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
'Discount':[1000,2300,1000,2300,2500,1500,2000]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 NaN 2300
2 Hadoop 22000 50days 1000
3 Pandas 24000 40days 2300
4 Spark 22000 30days 2500
5 Pandas 24000 None 1500
6 Hadoop 25000 NaN 2000
2. Pandas Count Duplicates
We can use DataFrame.pivot_table() function to count the duplicates in a single column. Set index
parameter as a list with a column along with aggfunc=size
into pivot_table()
function, it will return the count of the duplicate values of a specified single column of a given DataFrame.
# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)
Yields below output.
# Output:
Courses
Hadoop 2
Pandas 2
PySpark 1
Spark 2
dtype: int64
3. Get Count Duplicates of Multiple Columns
We can also use DataFrame.pivot_table() function to count the duplicates in multiple columns. For that, set index
parameter as a list of multiple columns along with aggfunc=size
into pivot_table()
function, it will return the count of the duplicate values of specified multiple columns of a given DataFrame.
# Get count duplicates multiple columns using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)
Yields below output.
# Output:
Courses Fee
Hadoop 22000 1
25000 1
Pandas 24000 2
PySpark 25000 1
Spark 22000 2
dtype: int64
4. Get Count Duplicates When having NaN Values
Let’s also learn how to count duplicate NaN values in a DataFrame using pivot_table()
function. First, let’s see what happens when we have NaN values on a column you are checking for duplicates.
# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)
Yields below output. Notice that this function has ignored all duplicate NaN values.
# Output:
Duration
30days 2
40days 1
50days 1
dtype: int64
5. Get Count Duplicate null Values Using fillna()
We can use fillna() function to assign null value for a NaN and then call the pivot_table()
function, It will return the count of the duplicate null values of a given DataFrame.
# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)
Yields below output.
# Output:
Duration
30days 2
40days 1
50days 1
NULL 3
dtype: int64
6. Get Count Duplicate Rows in Pandas DataFrame
Similarly, If you like to count duplicates on particular row or entire DataFrame using len()
function, this will return the count of the duplicate single rows and entire DataFrame.
# count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)
# Output:
# 3
# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)
# Output:
# 0
7. Get Count Duplicates for Each Unique Row
You can also use groupby on all the columns and call size
to get the duplicate values. It will return the count of the duplicate values of each unique row of a given DataFrame. For examples,
# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)
# Output:
# Courses Fee Duration Discount size
# 0 Hadoop 22000 50days 1000 1
# 1 Pandas 24000 40days 2300 1
# 2 Spark 22000 30days 1000 1
# 3 Spark 22000 30days 2500 1
8. Complete Example For Count Duplicates in DataFrame
# Complete Example For Count Duplicates in DataFrame
import pandas as pd
import numpy as np
technologies= ({
'Courses':["Spark","PySpark","Hadoop","Pandas","Spark","Pandas","Hadoop"],
'Fee' :[22000,25000,22000,24000,22000,24000,25000],
'Duration':['30days',np.nan,'50days','40days','30days', None,np.nan],
'Discount':[1000,2300,1000,2300,2500,1500,2000]
})
df = pd.DataFrame(technologies)
print(df)
# Get count duplicates single column using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses'], aggfunc ='size')
print(df2)
# Get count duplicates multiple columns using dataframe.pivot_table()
df2 = df.pivot_table(index = ['Courses', 'Fee'], aggfunc ='size')
print(df2)
# Get count duplicates When having nan values
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)
# Get count duplicate null using fillna()
df['Duration'] = df['Duration'].fillna('NULL')
df2 = df.pivot_table(columns=['Duration'], aggfunc='size')
print(df2)
# Count duplicate values in courses column
df2 = len(df['Courses'])-len(df['Courses'].drop_duplicates())
print(df2)
# Get count duplicate rows
df2 = len(df)-len(df.drop_duplicates())
print(df2)
# Get count duplicates for each unique row
df2 = df.groupby(df.columns.tolist(), as_index=False).size()
print(df2)
9. Conclusion
In this article, I have explained how to count duplicates in pandas DataFrame by using DataFrame.pivot_table()
function with examples like counting duplicates from a single column, multiple columns, counting with NaN e.t.c.
Happy Learning !!
Related Articles
- Pandas DataFrame count() Function
- Pandas Get List of All Duplicate Rows
- Pandas Count Distinct Values DataFrame
- Count NaN Values in Pandas DataFrame
- How to Create Pandas Pivot Table Count
- Pandas Count Unique Values in Column
- Pandas groupby() and count() with Examples
- Pandas Get Count of Each Row of DataFrame
- Pandas Drop Duplicate Rows in DataFrame
- Pandas Count The Frequency of a Value in Column