How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Quick Examples of GroupBy Multiple Columns
Following are examples of groupby on multiple columns & apply multiple aggregations.
# Quick examples of groupby multiple columns
# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print(result)
# Groupby multiple columns and aggregate on selected column
result = df.groupby(['Courses','Fee'])['Courses'].count()
print(result)
# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
print(result)
# Groupby multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
# Groupby & multiple aggregations on different columns
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Pandas GroupBy Multiple Columns Example
Most of the time when you are working on a real-time project in Pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
'Fee' :[20000,25000,26000,22000,25000,20000,35000],
'Duration':['30day','40days','35days','40days','60days','60days','70days'],
'Discount':[1000,2300,1200,2500,2000,2000,3000]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
Now let’s do a group on multiple columns and then calculate count aggregation.
# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print("After grouping by multiple columns:\n", result)
Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.
So when you want to group by count just select a column, you can even select from your group columns.
# Group by multiple columns and get
# count of one of grouping column
result = df.groupby(['Courses','Fee'])['Courses'].count(\n", result)
print("Get count of one of the grouping column:\n", result)
# Output:
# Get count of one of the grouping column:
# Courses Fee
# Hadoop 26000 1
# PySpark 25000 2
# Python 22000 1
# Spark 20000 2
# 35000 1
# Name: Courses, dtype: int6
Using aggregate()
You can use aggregate()
to perform multiple aggregations on different columns after grouping by multiple columns. This takes the count
function as a string param.
# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')
print("After grouping by multiple columns:\n", result)
# Output:
# After grouping by multiple columns:
# Courses Fee
# Hadoop 26000 1
# PySpark 25000 2
# Python 22000 1
# Spark 20000 2
# 35000 1
# Name: Duration, dtype: int64
Pandas Multiple Aggregations
You can also compute multiple aggregations simultaneously in Pandas by passing a list of aggregation functions to the aggregate()
function.
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print("After applying multiple aggregations on multiple group columns:\n", result)
# Output:
# After applying multiple aggregations on multiple group columns:
# min max
# Courses
# Hadoop 26000 26000
# PySpark 25000 25000
# Python 22000 22000
# Spark 20000 35000
In the above example, calculate the minimum and maximum values on the Fee
column. Now, let’s expand this process to calculate various aggregations on different columns
When multiple aggregations are applied to a single column in a Pandas DataFrame, the resulting columns will indeed contain a MultiIndex.
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print("After applying multiple aggregations on single group column:\n", result)
# Output:
# After applying multiple aggregations on single group column:
# Duration Fee
# count min max
# Courses
# Hadoop 1 26000 26000
# PySpark 2 25000 25000
# Python 1 22000 22000
# Spark 3 20000 35000
Observing that this creates a MultiIndex, it’s worth noting that managing multi-indexed columns can be challenging. Therefore, I’d suggest simplifying the structure by renaming the columns.
Conclusion
In this article, I have explained group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.
Related Articles
- Pandas groupby Aggregate Explained
- Pandas groupby Transform
- Pandas groupby Sort within Groups
- Pandas groupby() and count() with Examples
- Pandas groupby() Explained With Examples
- Pandas groupby() and sum() With Examples
- Pandas Group Rows into List Using groupby()
- How to groupby Index in Pandas?
- Pandas Percentage Total With groupby
- Pandas Check Column Contains a Value in DataFrame