You can get the count distinct values (equivalent to SQL count(distinct) ) in pandas using DataFrame.groupby(), nunique()
, DataFrame.agg(), DataFrame.transform()
, pandas.crosstab()
, Series.value_counts()
and pandas.pivot_table()
method.
In Pandas, you can use groupby()
with the combination of nunique()
, agg()
, crosstab()
, pivot()
, transform()
and Series.value_counts()
methods. In this article, I will cover how to get count distinct values of single and multiple columns of pandas DataFrame
Key Points –
- Pandas’
nunique()
method efficiently calculates the count of unique values in DataFrame columns. - Pandas provides the
nunique()
function to count distinct values in a DataFrame column. - The
nunique()
function returns the number of unique elements in a DataFrame column. - It can be applied to a single column or to the entire DataFrame.
nunique()
excludes NaN values by default but can include them with thedropna=False
parameter.- This function is useful for understanding the cardinality of categorical data and identifying unique values in a dataset.
1. Quick Examples of Count Distinct Values
If you are in a hurry, below are some quick examples of how to find count distinct values in Pandas DataFrame.
# Quick examples of count distinct values
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
, Duration
, Discount
, and Date
.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount Date
0 Spark 22000 30days 1000 2020-11-14
1 PySpark 25000 50days 2300 2020-11-17
2 Hadoop 23000 55days 1000 2021-11-15
3 Python 24000 40days 1200 2021-11-17
4 Pandas 26000 60days 2500 2021-11-14
5 Hadoop 25000 35days 1200 2021-11-14
6 Spark 25000 30days 1400 2021-11-17
7 Python 22000 50days 1000 2021-11-15
8 Spark 25000 40days 1200 2021-11-14
2. Use nunique() to get Count Distinct Values in Pandas
If you have SQL background you probably would have run count(distinct) to get the count of distinct values, you can achieve the same in pandas by grouping a column and then getting unique values for each group along with count. Use the .nunique()
method to get unique. For instance, df.groupby('Date')['Courses'].nunique()
.
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
print(df2)
Yields below output.
# Output:
Date
2020-11-14 1
2020-11-17 1
2021-11-14 3
2021-11-15 2
2021-11-17 2
Name: Courses, dtype: int64
3. Use nunique() for Multiple Columns To get Count Distinct
You can also use nunique
, count
on multiple columns to get a distinct count on multiple columns.
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)
Yields below output.
# Output:
Date Fee
nunique count nunique count
Date
2020-11-14 1 1 1 1
2020-11-17 1 1 1 1
2021-11-14 1 3 2 3
2021-11-15 1 2 2 2
2021-11-17 1 2 2 2
Let’s see another example.
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2)
Yields below output.
# Output:
Courses Fee
nunique sum count
Date
2020-11-14 1 22000 1
2020-11-17 1 25000 1
2021-11-14 3 76000 3
2021-11-15 2 45000 2
2021-11-17 2 49000 2
Here is an approach to get count distinct over multiple columns. Now, list the columns of interest and use groupby in a slightly modified syntax: df[columns].groupby(df['Courses']).nunique()
.
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)
Yields below output.
# Output:
Date Fee
Courses
Hadoop 2 2
Pandas 1 1
PySpark 1 1
Python 2 2
Spark 3 2
4. Use Series.value_counts() Method
Here is another method to get a distinct count and it is much simpler by using df.Date.value_counts()
.
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)
Yields below output.
# Output:
2021-11-14 3
2021-11-17 2
2021-11-15 2
2020-11-14 1
2020-11-17 1
Name: Date, dtype: int64
5. Use DataFrame.groupby() & transform() Method
You can also using nunique
but it will be very helpful if you have to use an aggregate function like count(distinct)
, 'min'
, 'max'
, 'count'
or 'mean'
etc.
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
print(df2)
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
print(df2)
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
print(df2)
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
print(df2)
6. Use pandas.crosstab() Method
Pandas provide multiple ways of achieving the same/similar results. Using crosstab()
method, this will return more information than groupby
, nunique
. In this example, we are going to use the method crosstab
.
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)
Yields below output.
# Output:
Date 2020-11-14 2020-11-17 2021-11-14 2021-11-15 2021-11-17
Courses
Hadoop 0 0 1 1 0
Pandas 0 0 1 0 0
PySpark 0 1 0 0 0
Python 0 0 0 1 1
Spark 1 0 1 0 1
You can summarize the information into unique values by using .ne(0)
.sum(1)
.
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)
Yields below output.
# Output:
Courses
Hadoop 2
Pandas 1
PySpark 1
Python 2
Spark 3
dtype: int64
7. Use pandas.pivot_table() and nunique Series Function
Similarly, you can use pivot_table()
along with the nunique()
function to calculate the count of unique values for each column, grouped by a specific column.
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)
Yields below output.
# Output:
Date Discount Duration Fee
Courses
Hadoop 2 2 2 2
Pandas 1 1 1 1
PySpark 1 1 1 1
Python 2 2 2 2
Spark 3 3 2 2
8. Complete Example For Count Distinct Values
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
})
df = pd.DataFrame(technologies)
print(df)
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
print(df2)
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2)
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
print(df2)
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
print(df2)
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
print(df2)
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
print(df2)
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)
Frequently Asked Questions on Pandas Count Distinct Values DataFrame
Counting distinct values in a DataFrame involves determining the number of unique values present in one or more columns of the DataFrame.
You can use the nunique()
method in Pandas to count the number of unique values in one or more columns of a DataFrame.
By default, nunique()
excludes NaN values when counting distinct values. However, you can include NaN values by setting the dropna=False
parameter.
You can count distinct values for multiple columns simultaneously by applying the nunique()
method to the entire DataFrame or by using it in combination with grouping functions like groupby()
.
The nunique()
method can be applied to string columns in a DataFrame. It counts the number of unique strings present in the specified column(s). Here’s how you can apply nunique()
to a string column
You can count distinct values for each group in a DataFrame by using the groupby()
function in combination with the nunique()
method.
Conclusion
In this article, you have learned how to pandas get count distinct values using DataFrame.groupby()
, nunique()
, pandas.crosstab()
, DataFrame.agg()
, DataFrame.transform()
, Series.value_counts()
and pandas.pivot_table()
method with examples.
Happy Learning !!
Related Articles
- Select Pandas DataFrame Rows Between Two Dates
- Groupby Columns and Get Count in Pandas
- Normalize Columns of Pandas DataFrame
- Pandas Count Unique Values in Column
- Pandas Count The Frequency of a Value in Column
- How to Count Duplicates in Pandas DataFrame
- Pandas DataFrame count() Function
- How to Get Pandas Columns Count
- Pandas DataFrame count() Function
- How to Change Column Name in Pandas
- Count NaN Values in Pandas DataFrame
- How to Create Pandas Pivot Table Count
- Pandas groupby() and count() with Examples
- Pandas Get Count of Each Row of DataFrame
- How to add/insert row to Pandas DataFrame?
- Pandas Count Rows with Condition
- Get Statistics For Each Group by pandas DataFrame
- Pandas Filter Rows Using IN and NOT IN Like SQL