Pandas groupby() and count() with Examples

You can use pandas DataFrame.groupby().count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.

In this article, I will explain how to use groupby() and count() aggregate together with examples. groupBy() function is used to collect the identical data into groups and perform aggregate functions like size/count on the grouped data.

1. Quick Examples of groupby() and count() of DataFrame

If you are in a hurry, below are some quick examples of how to group by columns and get the count for each group from DataFrame.


# Below are quick examples

# Using groupby() and count()
df2 = df.groupby(['Courses'])['Courses'].count()

# Using GroupBy & count() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].count()

# Using GroupBy & size() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].size()

# using DataFrame.size() and max()
df2 = df.groupby(['Courses','Duration']).size().groupby(level=0).max() 

# Use size().reset_index() method
df2 = df.groupby(['Courses','Duration']).size().reset_index(name='counts')

# Using pandas DataFrame.reset_index()
df2 = df.groupby(['Courses','Duration'])['Fee'].agg('count').reset_index()

# Using DataFrame.transform()
df2 = df.groupby(['Courses','Duration']).Courses.transform('count')

# Use DataFrame.groupby() and Size() 
print(df.groupby(['Discount','Duration']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='Duration'))

Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


# Create a pandas DataFrame.
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','35days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   35days      1000
3   Python  24000   40days      1200
4   Pandas  26000   60days      2500
5   Hadoop  25000   35days      1300
6    Spark  25000   55days      1400
7   Python  22000   50days      1600

2. Use count() by Column Name

Use pandas DataFrame.groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.


# Using groupby() and count()
df2 = df.groupby(['Courses'])['Courses'].count()
print(df2)

Yields below output.


Courses
Hadoop     2
Pandas     1
PySpark    1
Python     2
Spark      2
Name: Courses, dtype: int64

3. pandas groupby() and count() on List of Columns

You can also send a list of columns you wanted group to groupby() method, using this you can apply a groupby on multiple columns and calculate a count over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].count() does group on Courses and Duration column and finally calculates the count.


# Using groupby() & count() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].count()
print(df2)

Yields below output.


Courses  Duration
Hadoop   35days      2
Pandas   60days      1
PySpark  50days      1
Python   40days      1
         50days      1
Spark    30days      1
         55days      1
Name: Fee, dtype: int64

4. Sort after groupby and count

Sometimes you would be required to perform a sort (ascending or descending order) after performing group and count. You can achieve this using the below example.

Note that by default group by sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False param.


# Sorting after groupby() & count()
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).count()
sortedDF=groupedDF.sort_values('Courses', ascending=False)['Fee']
print(sortedDF)

Yields below output.


Courses
Spark      2
Python     2
PySpark    1
Pandas     1
Hadoop     2
Name: Fee, dtype: int64

5. Using groupby() with size()

Alternatively, you can also use size() to get the rows count for each group. You can use df.groupby(['Courses','Duration']).size() to get a total number of elements for each group Courses and Duration. columns


# Using GroupBy & size() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].size()
print(df2)

Yields same output as above.

Use df.groupby(['Courses','Duration']).size().groupby(level=1).max() to specify which level you want as output. Note that the level starts from zero.


# using DataFrame.size() and max()
df2 = df.groupby(['Courses','Duration']).size().groupby(level=0).max() 
print(df2)

Yields below output.


Courses
Hadoop     2
Pandas     1
PySpark    1
Python     1
Spark      1
dtype: int64

Then use size().reset_index(name='counts') to assign a name to the count column.


# Use size().reset_index() method
df2 = df.groupby(['Courses','Duration']).size().reset_index(name='counts')
print(df2)

Yields below output.


   Courses Duration  counts
0   Hadoop   35days       2
1   Pandas   60days       1
2  PySpark   50days       1
3   Python   40days       1
4   Python   50days       1
5    Spark   30days       1
6    Spark   55days       1

6. Pandas groupby() and using agg(‘count’)

Alternatively, you can also get the group count by using agg() or aggregate() function and passing the aggregate count function as a param. reset_index() function is used to set the index on DataFrame. By using this approach you can compute multiple aggregations.


# Using pandas DataFrame.reset_index()
df2 = df.groupby(['Courses','Duration'])['Fee'].agg('count').reset_index()
print(df2)

Yields below output.


   Courses Duration  Fee
0   Hadoop   35days    2
1   Pandas   60days    1
2  PySpark   50days    1
3   Python   40days    1
4   Python   50days    1
5    Spark   30days    1
6    Spark   55days    1

7. Using DataFrame.transform()

You can use df.groupby(['Courses','Fee']).Courses.transform('count') to add a new column containing the groups counts into the DataFrame.


# Using DataFrame.transform()
df2 = df.groupby(['Courses','Duration']).Courses.transform('count')
print(df2)

Yields below output.


0    1
1    1
2    2
3    1
4    1
5    2
6    1
7    1
Name: Courses, dtype: int64

8. Other Examples

Now let’s see how to sort rows from the result of pandas groupby and drop duplicate rows from pandas DataFrame.


# Use DataFrame.groupby() and Size() 
print(df.groupby(['Discount','Duration']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='Duration'))

Yields below output.


Name: Courses, dtype: int64
   Discount Duration  count
0      1000   30days      1
1      1000   35days      1
2      1200   40days      1
4      1400   55days      1
5      1600   50days      1
7      2500   60days      1

9. Complete Example of groupby and count


# Create a pandas DataFrame.
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','35days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

# Using groupby() and count()
df2 = df.groupby(['Courses'])['Courses'].count()
print(df2)

# Using GroupBy & count() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].count()
print(df2)

# Using GroupBy & size() on multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].size()
print(df2)

# using DataFrame.size() and max()
df2 = df.groupby(['Courses','Duration']).size().groupby(level=0).max() 
print(df2)

# Use size().reset_index() method
df2 = df.groupby(['Courses','Duration']).size().reset_index(name='counts')
print(df2)

# Using pandas DataFrame.reset_index()
df2 = df.groupby(['Courses','Duration'])['Fee'].agg('count').reset_index()
print(df2)

# Using DataFrame.transform()
df2 = df.groupby(['Courses','Duration']).Courses.transform('count')
print(df2)

# Use DataFrame.groupby() and Size() 
print(df.groupby(['Discount','Duration']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='Duration'))

Conclusion

In this article, you have learned how to groupby single and multiple columns and get the rows counts from pandas DataFrame Using DataFrame.groupby(), size(), count() and DataFrame.transform() methods with examples.

Happy Learning !!

You May Also Like

References

Leave a Reply