• Post author:
  • Post category:Pandas
  • Post last modified:May 21, 2024
  • Reading time:18 mins read
You are currently viewing Pandas groupby() Explained With Examples

Similar to the SQL GROUP BY clause Pandas DataFrame.groupby() function is used to collect identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

Advertisements

In Pandas, you can use groupby() with the combination of sum(), count(), pivot(), transform(), aggregate(), and many more methods to perform various operations on grouped data. In this article, I will cover how to group by a single column, or multiple columns by using groupby() with examples.

Pandas groupby() Syntax

Below is the syntax of the groupby() function, this function takes several params that are explained below and returns DataFrameGroupBy object that contains information about the groups.


# Syntax of DataFrame.groupby()
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, 
       sort=True, group_keys=True, squeeze=<no_default>, 
       observed=False, dropna=True)
  • by – List of column names to group by
  • axis – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
  • level – Used with MultiIndex.
  • as_index – sql style grouped output.
  • sort – Default to True. Specify whether to sort after the group
  • group_keys – add group keys or not
  • squeeze – deprecated in new versions
  • observed – This only applies if any of the groupers are Categoricals.
  • dropna – Default to False. Use True to drop None/Nan on sorry keys

To explain several examples of how to perform groupby, first, let’s create a simple Pandas DataFrame with a combination of string and numeric columns.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Pandas groupby

Use Pandas groupby() Function

As I said above groupby() function returns DataFrameGroupBy object after collecting the identical data into groups from pandas DataFrame. To perform several operations on DataFrameGroupby object using sum(), mean() e.t.c.

Let’s apply the groupby() function along with the sum() function to perform the sum operation on grouped data.


# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print("Get sum of grouped data:\n", df2)

Yields below output.

Pandas groupby

Pandas groupby() on Two or More Columns

Most of the time we would need to perform groupby on multiple columns of DataFrame, you can do this by passing a list of column labels you want to perform groupby on.


# Group by multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print("Get sum of groupby multiple columns:\n", df2)

Yields below output


# Output:
# Get sum of groupby multiple columns:
                    Fee  Discount
Courses Duration                 
Hadoop  35days    25000       0.0
        55days    23000    1000.0
NA      40days     1500       0.0
Pandas  60days    26000    2500.0
PySpark 50days    25000    2300.0
Python  40days    24000    1200.0
        50days    22000    1600.0
Spark   30days    47000    2400.0

Add Index to the Grouped Data

By default groupby() function doesn’t return the row Index, you can add the index using the DataFrame.reset_index() method.

Related: You can group the Pandas DataFrame by index.


# Add Row Index to the group by result
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print("After adding index to DataFrame:\n", df2)

Yields below output


# Output:
# After adding index to DataFrame:
   Courses Duration    Fee  Discount
0   Hadoop   35days  25000       0.0
1   Hadoop   55days  23000    1000.0
2       NA   40days   1500       0.0
3   Pandas   60days  26000    2500.0
4  PySpark   50days  25000    2300.0
5   Python   40days  24000    1200.0
6   Python   50days  22000    1600.0
7    Spark   30days  47000    2400.0

Drop NA /None/Nan (on group key) from the Result

You can also choose whether to include NA/None/Nan in group keys or not by setting dropna parameter. By default the value of dropna set to True. So, it does not include None/Nan values on the group keys set dropna=False parameter.


# Drop rows that have None/Nan on group keys
df2=df.groupby(by=['Courses'], dropna=False).sum()
print(df2)

Sort groupby() result by Group Key

To remove sorting on grouped results in pandas, you can pass sort=False parameter to the groupby() function. By passing sort=False to the groupby() function, you ensure that the grouped results are not sorted by the group key, preserving the original order of appearance of the courses in the DataFrame.


# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

To sort the group keys (courses) in descending order after performing the groupby() operation, you can use the sort_index() method with the ascending=False parameter.


# Grouping by Courses and summing, 
# Then sorting group keys in descending order
sortedDF = df.groupby('Courses', sort=False).sum().sort_index(ascending=False)
print(sortedDF)

# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

This code first groups the DataFrame by Courses, calculates the sum of each group, and then sorts the group keys (courses) in descending order using the sort_index() method with ascending=False.

You can use apply() along with a lambda function to sort each group by the Fee column.


# Using apply() & lambda
df2 = df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))
print(df2)

This code will group the DataFrame by Courses and then apply a lambda function to each group. The lambda function sorts each group by the ‘Fee’ column using sort_values().

Apply More Aggregations

You can also compute multiple aggregations at the same time in grouped data simply bypassing the list of aggregate functions to the aggregate().


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print("After applying multiple aggregations on grouped data:\n", result)

Yields below output.


# Output:
# After applying multiple aggregations on grouped data:
           min    max
Courses              
Hadoop   23000  25000
NA        1500   1500
Pandas   26000  26000
PySpark  25000  25000
Python   22000  24000
Spark    22000  25000

To compute different aggregations on different columns in a grouped DataFrame, you can pass a dictionary to the agg() function specifying the aggregation function for each column. Here, calculates the count on the Duration grouped column and calculates min and max on the Fee grouped column


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print("After applying multiple aggregations on grouped data:\n", result)

Notice that this creates MultiIndex. Working with multi-indexed columns can indeed be challenging. Flattening them by renaming the columns provides better readability and ease of use. This example yields the below output.


# Output:
# After applying multiple aggregations on grouped data:
        Duration    Fee       
           count    min    max
Courses                       
Hadoop         2  23000  25000
NA             1   1500   1500
Pandas         1  26000  26000
PySpark        1  25000  25000
Python         2  22000  24000
Spark          2  22000  25000

Complete Example of groupby() Method.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print(df))

# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)

# Group by on multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)

# Set Index on group by results
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)

# Using groupby on single column in pandas 
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

# Ignore sorting on group by key
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

# Sort group key on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

FAQ on Pandas groupby()

What is groupby() in pandas?

groupby() is a powerful function in pandas that is used for grouping data based on some criteria. It enables you to split a DataFrame into groups based on one or more columns and then apply a function (such as aggregation, transformation, or filtering) to each group independently.

How to use groupby() in pandas?

The basic syntax of groupby() involves selecting a column or columns to group by and then applying an aggregation function.

How to apply custom functions with groupby()?

To apply custom functions with groupby() in pandas, you can use either the apply() method or the agg() method.

Can I perform transformations with groupby()?

You can perform transformations within each group using the transform() method with groupby() in pandas. The transform() method is particularly useful when you want to maintain the original shape of the DataFrame but apply a transformation to each group independently.

How to sort the results of groupby()?

You can sort the results of groupby() in pandas using the sort_values() method. For example, grouped_data contains the mean value for each category. The sort_values() method is then used to sort the result in descending order based on the mean values.

Conclusion

In conclusion, the groupby() function in Pandas is a powerful tool for splitting data into groups based on one or more criteria, performing operations on each group, and then combining the results. Throughout this article, we’ve explored its syntax and various examples to understand its usage comprehensively.

Happy Learning !!

Leave a Reply

This Post Has One Comment

  1. Garret

    This is great, thanks so much. The reset_index operation was what I was looking for…