You can get the count distinct values (equivalent to SQL count(distinct) ) in pandas using DataFrame.groupby(), nunique()
, DataFrame.agg(), DataFrame.transform()
, pandas.crosstab()
, Series.value_counts()
and pandas.pivot_table()
method.
In Pandas, you can use groupby()
with the combination of nunique()
, agg()
, crosstab()
, pivot()
, transform()
and Series.value_counts()
methods. In this article, I will cover how to get count distinct values of single and multiple columns of pandas DataFrame
1. Quick Examples of Count Distinct Values
If you are in a hurry, below are some quick examples of how to find count distinct values in pandas DataFrame.
# Below are the quick examples
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
, Duration
, Discount
, and Date
.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount Date
0 Spark 22000 30days 1000 2020-11-14
1 PySpark 25000 50days 2300 2020-11-17
2 Hadoop 23000 55days 1000 2021-11-15
3 Python 24000 40days 1200 2021-11-17
4 Pandas 26000 60days 2500 2021-11-14
5 Hadoop 25000 35days 1200 2021-11-14
6 Spark 25000 30days 1400 2021-11-17
7 Python 22000 50days 1000 2021-11-15
8 Spark 25000 40days 1200 2021-11-14
2. Use nunique() to get Count Distinct Values in Pandas
If you have SQL background you probably would have run count(distinct) to get the count of distinct values, you can achieve the same in pandas by grouping a column and then get unique values for each group along with count. Use the .nunique()
method to get unique. for instance, df.groupby('Date')['Courses'].nunique()
.
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
print(df2)
Yields below output.
# Output:
Date
2020-11-14 1
2020-11-17 1
2021-11-14 3
2021-11-15 2
2021-11-17 2
Name: Courses, dtype: int64
3. Use nunique() for Multiple Columns To get Count Distinct
You can also use nunique, count
on multiple columns to get a distinct count on multiple columns.
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)
Yields below output.
# Output:
Date Fee
nunique count nunique count
Date
2020-11-14 1 1 1 1
2020-11-17 1 1 1 1
2021-11-14 1 3 2 3
2021-11-15 1 2 2 2
2021-11-17 1 2 2 2
Let’s see another example.
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2)
Yields below output.
# Output:
Courses Fee
nunique sum count
Date
2020-11-14 1 22000 1
2020-11-17 1 25000 1
2021-11-14 3 76000 3
2021-11-15 2 45000 2
2021-11-17 2 49000 2
Here is an approach to get count distinct over multiple columns. Now, list the columns of interest and use groupby in a slightly modified syntax: df[columns].groupby(df['Courses']).nunique()
.
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)
Yields below output.
# Output:
Date Fee
Courses
Hadoop 2 2
Pandas 1 1
PySpark 1 1
Python 2 2
Spark 3 2
4. Use Series.value_counts() Method
Here is another method to get a distinct count and it is much simpler by using df.Date.value_counts()
.
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)
Yields below output.
# Output:
2021-11-14 3
2021-11-17 2
2021-11-15 2
2020-11-14 1
2020-11-17 1
Name: Date, dtype: int64
5. Use DataFrame.groupby() & transform() Method
You can also using nunique
but it will be very helpful if you have to use an aggregate function like count(distinct)
, 'min'
, 'max'
, 'count'
or 'mean'
etc.
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
print(df2)
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
print(df2)
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
print(df2)
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
print(df2)
6. Use pandas.crosstab() Method
Pandas provide multiple ways of achieving the same/similar results. Using crosstab() method
, this will return more information than groupby
, nunique
. In this example we are going to use the method crosstab
.
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)
Yields below output.
# Output:
Date 2020-11-14 2020-11-17 2021-11-14 2021-11-15 2021-11-17
Courses
Hadoop 0 0 1 1 0
Pandas 0 0 1 0 0
PySpark 0 1 0 0 0
Python 0 0 0 1 1
Spark 1 0 1 0 1
You can summarize the information into unique values by using .ne(0)
.sum(1)
.
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)
Yields below output.
# Output:
Courses
Hadoop 2
Pandas 1
PySpark 1
Python 2
Spark 3
dtype: int64
7. Use pandas.pivot_table() and nunique Series Function
You can create a pivot table and use the nunique series function
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)
Yields below output.
# Output:```
Date Discount Duration Fee
Courses
Hadoop 2 2 2 2
Pandas 1 1 1 1
PySpark 1 1 1 1
Python 2 2 2 2
Spark 3 3 2 2
8. Complete Example For Count Distinct Values
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
})
df = pd.DataFrame(technologies)
print(df)
# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)
# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()
print(df2)
# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)
# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2)
# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)
# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)
# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)
# Min
df2 = df.groupby('Date')['Fee'].transform('min')
print(df2)
# Max
df2 = df.groupby('Date')['Fee'].transform('max')
print(df2)
# Average
df2 = df.groupby('Date')['Fee'].transform('mean')
print(df2)
# Count
df2 = df.groupby('Date')['Fee'].transform('count')
print(df2)
# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)
# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)
# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)
Conclusion
In this article, you have learned how to pandas get count distinct values using DataFrame.groupby()
, nunique()
, pandas.crosstab()
, DataFrame.agg()
, DataFrame.transform()
, Series.value_counts()
and pandas.pivot_table()
method with examples.
Happy Learning !!
Related Articles
- Select Pandas DataFrame Rows Between Two Dates
- Groupby Columns and Get Count in Pandas
- Normalize Columns of Pandas DataFrame
- Get Statistics For Each Group by pandas DataFrame
- Pandas Filter Rows Using IN and NOT IN Like SQL
- Pandas Count Unique Values in Column
- Pandas Count The Frequency of a Value in Column
- How to Count Duplicates in Pandas DataFrame
- Pandas DataFrame count() Function
- How to Get Pandas Columns Count
- Pandas DataFrame count() Function
- Count NaN Values in Pandas DataFrame
- How to Create Pandas Pivot Table Count
- Pandas groupby() and count() with Examples
- Pandas Get Count of Each Row of DataFrame
- How to add/insert row to Pandas DataFrame?
- Pandas Count Rows with Condition