• Post author:
  • Post category:Pandas
  • Post last modified:May 22, 2024
  • Reading time:7 mins read
You are currently viewing Pandas GroupBy Multiple Columns Explained

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Advertisements

Quick Examples of GroupBy Multiple Columns

Following are examples of groupby on multiple columns & apply multiple aggregations.


# Quick examples of groupby multiple columns

# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print(result)

# Groupby multiple columns and aggregate on selected column
result = df.groupby(['Courses','Fee'])['Courses'].count()
print(result)

# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
print(result)

# Groupby multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby & multiple aggregations on different columns
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Pandas GroupBy Multiple Columns Example

Most of the time when you are working on a real-time project in Pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee' :[20000,25000,26000,22000,25000,20000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days','70days'],
    'Discount':[1000,2300,1200,2500,2000,2000,3000]
              }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas groupby multiple columns

Now let’s do a group on multiple columns and then calculate count aggregation.


# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print("After grouping by multiple columns:\n", result)

Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.

pandas groupby multiple columns

So when you want to group by count just select a column, you can even select from your group columns.


# Group by multiple columns and get 
# count of one of grouping column
result = df.groupby(['Courses','Fee'])['Courses'].count(\n", result)
print("Get count of one of the grouping column:\n", result)

# Output:
# Get count of one of the grouping column:
# Courses  Fee  
# Hadoop   26000    1
# PySpark  25000    2
# Python   22000    1
# Spark    20000    2
#          35000    1
# Name: Courses, dtype: int6

Using aggregate()

You can use aggregate() to perform multiple aggregations on different columns after grouping by multiple columns. This takes the count function as a string param.


# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')
print("After grouping by multiple columns:\n", result)

# Output:
# After grouping by multiple columns:
# Courses  Fee  
# Hadoop   26000    1
# PySpark  25000    2
# Python   22000    1
# Spark    20000    2
#          35000    1
# Name: Duration, dtype: int64

Pandas Multiple Aggregations

You can also compute multiple aggregations simultaneously in Pandas by passing a list of aggregation functions to the aggregate() function.


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print("After applying multiple aggregations on multiple group columns:\n", result)

# Output:
# After applying multiple aggregations on multiple group columns:
#        min    max
# Courses              
# Hadoop   26000  26000
# PySpark  25000  25000
# Python   22000  22000
# Spark    20000  35000

In the above example, calculate the minimum and maximum values on the Fee column. Now, let’s expand this process to calculate various aggregations on different columns

When multiple aggregations are applied to a single column in a Pandas DataFrame, the resulting columns will indeed contain a MultiIndex.


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print("After applying multiple aggregations on single group column:\n", result)

# Output:
#  After applying multiple aggregations on single group column:
#         Duration    Fee       
#            count    min    max
# Courses                       
# Hadoop         1  26000  26000
# PySpark        2  25000  25000
# Python         1  22000  22000
# Spark          3  20000  35000

Observing that this creates a MultiIndex, it’s worth noting that managing multi-indexed columns can be challenging. Therefore, I’d suggest simplifying the structure by renaming the columns.

Conclusion

In this article, I have explained group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.