Pandas Count Distinct Values DataFrame

  • Post author:
  • Post category:Pandas
  • Post last modified:November 23, 2023

You can get the count distinct values (equivalent to SQL count(distinct) ) in pandas using DataFrame.groupby(), nunique(), DataFrame.agg(), DataFrame.transform(), pandas.crosstab(), Series.value_counts() and pandas.pivot_table() method.

In Pandas, you can use groupby() with the combination of nunique(), agg(), crosstab(), pivot(), transform() and Series.value_counts() methods. In this article, I will cover how to get count distinct values of single and multiple columns of pandas DataFrame

1. Quick Examples of Count Distinct Values

If you are in a hurry, below are some quick examples of how to find count distinct values in pandas DataFrame.


# Below are the quick examples 

# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  

# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})

# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})

# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique() 

# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()

# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   

# Count
df2 = df.groupby('Date')['Fee'].transform('count')                                   

# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)

# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)

# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)

Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses, Fee, Duration, Discount, and Date.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
    'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount        Date
0    Spark  22000   30days      1000  2020-11-14
1  PySpark  25000   50days      2300  2020-11-17
2   Hadoop  23000   55days      1000  2021-11-15
3   Python  24000   40days      1200  2021-11-17
4   Pandas  26000   60days      2500  2021-11-14
5   Hadoop  25000   35days      1200  2021-11-14
6    Spark  25000   30days      1400  2021-11-17
7   Python  22000   50days      1000  2021-11-15
8    Spark  25000   40days      1200  2021-11-14

2. Use nunique() to get Count Distinct Values in Pandas

If you have SQL background you probably would have run count(distinct) to get the count of distinct values, you can achieve the same in pandas by grouping a column and then get unique values for each group along with count. Use the .nunique() method to get unique. for instance, df.groupby('Date')['Courses'].nunique().


# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  
print(df2)

Yields below output.


# Output:
Date
2020-11-14    1
2020-11-17    1
2021-11-14    3
2021-11-15    2
2021-11-17    2
Name: Courses, dtype: int64

3. Use nunique() for Multiple Columns To get Count Distinct

You can also use nunique, count on multiple columns to get a distinct count on multiple columns.


# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)

Yields below output.


# Output:
              Date           Fee      
           nunique count nunique count
Date                                  
2020-11-14       1     1       1     1
2020-11-17       1     1       1     1
2021-11-14       1     3       2     3
2021-11-15       1     2       2     2
2021-11-17       1     2       2     2

Let’s see another example.


# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2) 

Yields below output.


# Output:
           Courses    Fee      
           nunique    sum count
Date                           
2020-11-14       1  22000     1
2020-11-17       1  25000     1
2021-11-14       3  76000     3
2021-11-15       2  45000     2
2021-11-17       2  49000     2

Here is an approach to get count distinct over multiple columns. Now, list the columns of interest and use groupby in a slightly modified syntax: df[columns].groupby(df['Courses']).nunique().


# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)  

Yields below output.


# Output:
         Date  Fee
Courses           
Hadoop      2    2
Pandas      1    1
PySpark     1    1
Python      2    2
Spark       3    2

4. Use Series.value_counts() Method

Here is another method to get a distinct count and it is much simpler by using df.Date.value_counts().


# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)

Yields below output.


# Output:
2021-11-14    3
2021-11-17    2
2021-11-15    2
2020-11-14    1
2020-11-17    1
Name: Date, dtype: int64

5. Use DataFrame.groupby() & transform() Method

You can also using nunique but it will be very helpful if you have to use an aggregate function like count(distinct), 'min', 'max', 'count' or 'mean' etc.


# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     
print(df2)

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     
print(df2)

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   
print(df2) 

# Count
df2 = df.groupby('Date')['Fee'].transform('count')  
print(df2)

6. Use pandas.crosstab() Method

Pandas provide multiple ways of achieving the same/similar results. Using crosstab() method, this will return more information than groupby, nunique. In this example we are going to use the method crosstab.


# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2) 

Yields below output.


# Output:
Date     2020-11-14  2020-11-17  2021-11-14  2021-11-15  2021-11-17
Courses                                                            
Hadoop            0           0           1           1           0
Pandas            0           0           1           0           0
PySpark           0           1           0           0           0
Python            0           0           0           1           1
Spark             1           0           1           0           1

You can summarize the information into unique values by using .ne(0).sum(1).


# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)

Yields below output.


# Output:
Courses
Hadoop     2
Pandas     1
PySpark    1
Python     2
Spark      3
dtype: int64

7. Use pandas.pivot_table() and nunique Series Function

You can create a pivot table and use the nunique series function


# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)   

Yields below output.


# Output:```
         Date  Discount  Duration  Fee
Courses                               
Hadoop      2         2         2    2
Pandas      1         1         1    1
PySpark     1         1         1    1
Python      2         2         2    2
Spark       3         3         2    2

8. Complete Example For Count Distinct Values


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
    'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
          })
df = pd.DataFrame(technologies)
print(df)

# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  
print(df2)

# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)

# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2) 

# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)  

# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)

# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     
print(df2)

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     
print(df2)

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   
print(df2) 

# Count
df2 = df.groupby('Date')['Fee'].transform('count')  
print(df2)                                    

# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)

# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)

# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)   

Conclusion

In this article, you have learned how to pandas get count distinct values using DataFrame.groupby(), nunique(), pandas.crosstab(), DataFrame.agg(), DataFrame.transform(), Series.value_counts() and pandas.pivot_table() method with examples.

Happy Learning !!

References

Naveen

I am a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, I have honed my expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. My journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. I have started this SparkByExamples.com to share my experiences with the data as I come across. You can learn more about me at LinkedIn

Leave a Reply

You are currently viewing Pandas Count Distinct Values DataFrame