• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:20 mins read
You are currently viewing Pandas Count Distinct Values DataFrame

You can get the count distinct values (equivalent to SQL count(distinct) ) in pandas using DataFrame.groupby(), nunique(), DataFrame.agg(), DataFrame.transform(), pandas.crosstab(), Series.value_counts() and pandas.pivot_table() method.

In Pandas, you can use groupby() with the combination of nunique(), agg(), crosstab(), pivot(), transform() and Series.value_counts() methods. In this article, I will cover how to get count distinct values of single and multiple columns of pandas DataFrame

Key Points –

  • Pandas’ nunique() method efficiently calculates the count of unique values in DataFrame columns.
  • Pandas provides the nunique() function to count distinct values in a DataFrame column.
  • The nunique() function returns the number of unique elements in a DataFrame column.
  • It can be applied to a single column or to the entire DataFrame.
  • nunique() excludes NaN values by default but can include them with the dropna=False parameter.
  • This function is useful for understanding the cardinality of categorical data and identifying unique values in a dataset.

1. Quick Examples of Count Distinct Values

If you are in a hurry, below are some quick examples of how to find count distinct values in Pandas DataFrame.


# Quick examples of count distinct values

# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  

# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})

# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})

# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique() 

# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()

# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   

# Count
df2 = df.groupby('Date')['Fee'].transform('count')                                   

# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)

# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)

# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)

Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses, Fee, Duration, Discount, and Date.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
    'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount        Date
0    Spark  22000   30days      1000  2020-11-14
1  PySpark  25000   50days      2300  2020-11-17
2   Hadoop  23000   55days      1000  2021-11-15
3   Python  24000   40days      1200  2021-11-17
4   Pandas  26000   60days      2500  2021-11-14
5   Hadoop  25000   35days      1200  2021-11-14
6    Spark  25000   30days      1400  2021-11-17
7   Python  22000   50days      1000  2021-11-15
8    Spark  25000   40days      1200  2021-11-14

2. Use nunique() to get Count Distinct Values in Pandas

If you have SQL background you probably would have run count(distinct) to get the count of distinct values, you can achieve the same in pandas by grouping a column and then getting unique values for each group along with count. Use the .nunique() method to get unique. For instance, df.groupby('Date')['Courses'].nunique().


# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  
print(df2)

Yields below output.


# Output:
Date
2020-11-14    1
2020-11-17    1
2021-11-14    3
2021-11-15    2
2021-11-17    2
Name: Courses, dtype: int64

3. Use nunique() for Multiple Columns To get Count Distinct

You can also use nunique, count on multiple columns to get a distinct count on multiple columns.


# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)

Yields below output.


# Output:
              Date           Fee      
           nunique count nunique count
Date                                  
2020-11-14       1     1       1     1
2020-11-17       1     1       1     1
2021-11-14       1     3       2     3
2021-11-15       1     2       2     2
2021-11-17       1     2       2     2

Let’s see another example.


# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2) 

Yields below output.


# Output:
           Courses    Fee      
           nunique    sum count
Date                           
2020-11-14       1  22000     1
2020-11-17       1  25000     1
2021-11-14       3  76000     3
2021-11-15       2  45000     2
2021-11-17       2  49000     2

Here is an approach to get count distinct over multiple columns. Now, list the columns of interest and use groupby in a slightly modified syntax: df[columns].groupby(df['Courses']).nunique().


# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)  

Yields below output.


# Output:
         Date  Fee
Courses           
Hadoop      2    2
Pandas      1    1
PySpark     1    1
Python      2    2
Spark       3    2

4. Use Series.value_counts() Method

Here is another method to get a distinct count and it is much simpler by using df.Date.value_counts().


# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)

Yields below output.


# Output:
2021-11-14    3
2021-11-17    2
2021-11-15    2
2020-11-14    1
2020-11-17    1
Name: Date, dtype: int64

5. Use DataFrame.groupby() & transform() Method

You can also using nunique but it will be very helpful if you have to use an aggregate function like count(distinct), 'min', 'max', 'count' or 'mean' etc.


# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     
print(df2)

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     
print(df2)

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   
print(df2) 

# Count
df2 = df.groupby('Date')['Fee'].transform('count')  
print(df2)

6. Use pandas.crosstab() Method

Pandas provide multiple ways of achieving the same/similar results. Using crosstab() method, this will return more information than groupby, nunique. In this example, we are going to use the method crosstab.


# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2) 

Yields below output.


# Output:
Date     2020-11-14  2020-11-17  2021-11-14  2021-11-15  2021-11-17
Courses                                                            
Hadoop            0           0           1           1           0
Pandas            0           0           1           0           0
PySpark           0           1           0           0           0
Python            0           0           0           1           1
Spark             1           0           1           0           1

You can summarize the information into unique values by using .ne(0).sum(1).


# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)

Yields below output.


# Output:
Courses
Hadoop     2
Pandas     1
PySpark    1
Python     2
Spark      3
dtype: int64

7. Use pandas.pivot_table() and nunique Series Function

Similarly, you can use pivot_table() along with the nunique() function to calculate the count of unique values for each column, grouped by a specific column.


# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)   

Yields below output.


# Output:
         Date  Discount  Duration  Fee
Courses                               
Hadoop      2         2         2    2
Pandas      1         1         1    1
PySpark     1         1         1    1
Python      2         2         2    2
Spark       3         3         2    2

8. Complete Example For Count Distinct Values


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200],
    'Date':["2020-11-14","2020-11-17","2021-11-15","2021-11-17","2021-11-14","2021-11-14","2021-11-17","2021-11-15","2021-11-14"]
          })
df = pd.DataFrame(technologies)
print(df)

# Use DataFrame.groupby() & nunique() method
df2 = df.groupby('Date').Courses.nunique()
print(df2)

# Get count of unique values in column pandas
df2 = df.groupby('Date')['Courses'].nunique()  
print(df2)

# Use .nunique() for multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Date': ['nunique', 'count'],'Fee': ['nunique', 'count']})
print(df2)

# Use DataFrame.groupby() and agg multiple columns to count(distinct) equivalent
df2 = df.groupby('Date').agg({'Courses': ['nunique'], 'Fee' : ['sum', 'count']})
print(df2) 

# Use DataFrame.groupby() and nunique() to multiple columns
columns = ['Date','Fee']
df2 = df[columns].groupby(df['Courses']).nunique()
print(df2)  

# Use DataFrame.column.value_counts() method
df2 = df.Date.value_counts()
print(df2)

# Count(distinct)
df2 = df.groupby('Date')['Fee'].transform('nunique')
print(df2)

# Min
df2 = df.groupby('Date')['Fee'].transform('min')     
print(df2)

# Max
df2 = df.groupby('Date')['Fee'].transform('max')     
print(df2)

# Average
df2 = df.groupby('Date')['Fee'].transform('mean')   
print(df2) 

# Count
df2 = df.groupby('Date')['Fee'].transform('count')  
print(df2)                                    

# Using pandas.crosstab() method
df2 = pd.crosstab(df.Courses, df.Date)
print(df2)

# Using pandas.crosstab() and .ne(0).sum(1)
df2 = pd.crosstab(df.Courses, df.Date).ne(0).sum(1)
print(df2)

# Use pandas.pivot_table() and nunique series function
df2 = pd.pivot_table(data=df, index='Courses', aggfunc=pd.Series.nunique)
print(df2)   

Frequently Asked Questions on Pandas Count Distinct Values DataFrame

What does it mean to count distinct values in a DataFrame?

Counting distinct values in a DataFrame involves determining the number of unique values present in one or more columns of the DataFrame.

How can I count distinct values in a DataFrame using Pandas?

You can use the nunique() method in Pandas to count the number of unique values in one or more columns of a DataFrame.

Does nunique() count NaN values?

By default, nunique() excludes NaN values when counting distinct values. However, you can include NaN values by setting the dropna=False parameter.

Can I count distinct values for multiple columns simultaneously?

You can count distinct values for multiple columns simultaneously by applying the nunique() method to the entire DataFrame or by using it in combination with grouping functions like groupby().

Can nunique() be applied to string columns?

The nunique() method can be applied to string columns in a DataFrame. It counts the number of unique strings present in the specified column(s). Here’s how you can apply nunique() to a string column

How can I count distinct values for each group in a DataFrame?

You can count distinct values for each group in a DataFrame by using the groupby() function in combination with the nunique() method.

Conclusion

In this article, you have learned how to pandas get count distinct values using DataFrame.groupby(), nunique(), pandas.crosstab(), DataFrame.agg(), DataFrame.transform(), Series.value_counts() and pandas.pivot_table() method with examples.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply