How to count unique values of a column in pandas DataFrame? – When working on machine learning or data analysis with pandas we are often required to get the count of unique or distinct values from a single column or multiple columns.
You can get the number of unique values in the column of pandas DataFrame using several ways like using functions Series.unique.size, Series.nunique(), Series.drop_duplicates().size(). Since the DataFrame column is internally represented as a Series, you can use these functions to perform the operation.
1. Quick Examples of Count Unique Values in Column
Following are quick examples of how to count unique values in column.
#Below are quick examples
# Get Unique Count using Series.unique()
count = df.Courses.unique().size
# Using Series.nunique()
count = df.Courses.nunique()
# Get frequency of each value
frequence = df.Courses.value_counts()
# By using drop_duplicates()
count = df.Courses.drop_duplicates().size
#Count unique on multiple columns
count = df[['Courses','Fee']].drop_duplicates().shape[0]
print(count)
#Count unique on multiple columns
count = df[['Courses','Fee']].nunique()
#count unique values in each row
#df.nunique(axis=1)
Let’s create a DataFrame.
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","Pandas","Python","Spark","Pandas"],
'Fee' :[20000,25000,22000,30000,25000,20000,30000],
'Duration':['30days','40days','35days','50days','40days','30days','50days'],
'Discount':[1000,2300,1200,2000,2300,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Python 22000 35days 1200
3 Pandas 30000 50days 2000
4 Python 25000 40days 2300
5 Spark 20000 30days 1000
6 Pandas 30000 50days 2000
2. Pandas Count Unique Values
To count unique values in the pandas dataframe column use Series.unique()
function and then call the size
to get the count. Series.unique()
function get all unique values from a column by removing duplicate values and this function returns a ndarray
with unique value in the order of appearance and the results are not sorted.
Syntax: Series.unique()
Example:
# Get Unique Count using Series.unique()
count = df.Courses.unique().size
print("Unique values count : "+ str(count))
# Output
# Unique values count : 4
3. Series.nunique()
Alternatively, you can also try using Series.nunique()
, this returns the number of unique elements in the object excluding NaN
values. If you wanted to include NaN
values use dropna
param to False
.
Syntax: Series.nunique(dropna=True)
Example:
# Using Series.nunique()
count = df.Courses.nunique()
print("Unique values count : "+ str(count))
# Outputs
# Unique values count : 4
4. Using Series.drop_duplicates()
User Series drop_duplicates() to remove the distinct values from coumn and use the size to get the count.
# By using drop_duplicates()
count = df.Courses.drop_duplicates().size
print("Unique values count : "+ str(count))
# Outputs
# Unique values count : 4
4. Pandas Count Unique Values in Multiple Columns
In order to get the count of unique values on multiple columns use pandas DataFrame.drop_duplicates()
which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.
On the result use shape
property to get the shape of the DataFrame which ideally returns a tuple with rows and columns, use shape[0] to get the row count.
# Count unique on multiple columns
count = df[['Courses','Fee']].drop_duplicates().shape[0]
print("Unique multiple columns : "+ str(count))
# Outputs
# Unique multiple columns : 5
4. Get How May Times Value Occur
In case you want to get the frequency of a column use Series.value_counts()
, This returns the Count of Frequency of a Value in Column
print(df.Courses.value_counts())
# Outputs
Spark 2
Python 2
Pandas 2
PySpark 1
Name: Courses, dtype: int64
5. Count Unique Values in Row
Use axis=1
to get the count of unique values in Row.
# Count unique values in each row
df.nunique(axis=1)
7. Complete Example
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Python","Pandas","Python","Spark","Pandas"],
'Fee' :[20000,25000,22000,30000,25000,20000,30000],
'Duration':['30days','40days','35days','50days','40days','30days','50days'],
'Discount':[1000,2300,1200,2000,2300,1000,2000]
}
df = pd.DataFrame(technologies)
print(df)
# Get Unique Count using Series.unique()
count = df.Courses.unique().size
print("Unique values count : "+ str(count))
# Using Series.nunique()
count = df.Courses.nunique()
print("Unique values count : "+ str(count))
# Get frequency of each value
frequency = df.Courses.value_counts()
print(frequency)
# By using drop_duplicates()
count = df.Courses.drop_duplicates().size
print("Unique values count : "+ str(count))
#Count unique on multiple columns
count = df[['Courses','Fee']].drop_duplicates().shape[0]
print("Unique multiple columns : "+ str(count))
#Count unique on multiple columns
count = df[['Courses','Fee']].nunique()
print(count)
#count unique values in each row
#df.nunique(axis=1)
Conclusion
In this article, you have learned how to get the count of unique values of a pandas DataFrame column using Series.unique(), Series.nunique(), Series.drop_duplicates() and also learned how to get the distinct values from multiple columns.