• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:8 mins read
You are currently viewing Pandas GroupBy Multiple Columns Explained

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Advertisements

1. Quick Examples of GroupBy Multiple Columns

Following are examples of how to groupby on multiple columns & apply multiple aggregations.


# Quick Examples

# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print(result)

# Groupby multiple columns and aggregate on selected column
result = df.groupby(['Courses','Fee'])['Courses'].count()
print(result)

# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
print(result)

# Groupby multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby & multiple aggregations on different columns
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

2. Pandas GroupBy Multiple Columns Example

Most of the time when you are working on a real-time project in Pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee' :[20000,25000,26000,22000,25000,20000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days','70days'],
    'Discount':[1000,2300,1200,2500,2000,2000,3000]
              }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas groupby multiple columns

Now let’s do a group on multiple columns and then calculate count aggregation.


# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print("After grouping by multiple columns:\n", result)

Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.

pandas groupby multiple columns

So when you want to group by count just select a column, you can even select from your group columns.


# Group by multiple columns and get 
# count of one of grouping column
result = df.groupby(['Courses','Fee'])['Courses'].count(\n", result)
print("Get count of one of the grouping column:\n", result)

# Output:
# Get count of one of the grouping column:
# Courses  Fee  
# Hadoop   26000    1
# PySpark  25000    2
# Python   22000    1
# Spark    20000    2
#          35000    1
# Name: Courses, dtype: int6

2. Using aggregate()

Alternatively, you can also use the aggregate() function. This takes the count function as a string param.


# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')
print("After grouping by multiple columns:\n", result)

Yields below output.


# Output:
After grouping by multiple columns:
Courses  Fee  
Hadoop   26000    1
PySpark  25000    2
Python   22000    1
Spark    20000    2
         35000    1
Name: Duration, dtype: int64

3. Pandas Multiple Aggregations Example

You can also compute multiple aggregations at the same time in Pandas by using the list to the aggregate().


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print("After applying multiple aggregations on multiple group columns:\n", result)

Yields below output.


# Output:
# After applying multiple aggregations on multiple group columns:
       min    max
Courses              
Hadoop   26000  26000
PySpark  25000  25000
Python   22000  22000
Spark    20000  35000

The above example calculates min and max on the Fee column. Let’s extend this to compute different aggregations on different columns.

Note that applying multiple aggregations to a single column in pandas DataFrame will result in a MultiIndex.


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print("After applying multiple aggregations on single group column:\n", result)

Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.


# Output:
# After applying multiple aggregations on single group column:
        Duration    Fee       
           count    min    max
Courses                       
Hadoop         1  26000  26000
PySpark        2  25000  25000
Python         1  22000  22000
Spark          3  20000  35000

Conclusion

In this article, you have learned how to group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium