Pandas GroupBy Multiple Columns Explained

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:February 8, 2022

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

1. Quick Examples of GroupBy Multiple Columns

Following are examples of how to groupby on multiple columns & apply multiple aggregations.


# Quick Examples

# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print(result)

# Groupby multiple columns and aggregate on selected column
result = df.groupby(['Courses','Fee'])['Courses'].count()
print(result)

# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Duration'].aggregate('count')
print(result)

# Groupby multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby & multiple aggregations on different columns
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

2. pandas GroupBy Multiple Columns Example

Most of the time when you are working on a real-time project in pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee' :[20000,25000,26000,22000,25000,20000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days','70days'],
    'Discount':[1000,2300,1200,2500,2000,2000,3000]
              }

df = pd.DataFrame(technologies)
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  20000    30day      1000
1  PySpark  25000   40days      2300
2   Hadoop  26000   35days      1200
3   Python  22000   40days      2500
4  PySpark  25000   60days      2000
5    Spark  20000   60days      2000
6    Spark  35000   70days      3000

Now let’s do a group on multiple columns and then calculate count aggregation.


# Groupby multiple columns
result = df.groupby(['Courses','Fee']).count()
print(result)

Yields below output. When you apply count on the entire DataFrame, pretty much all columns will have the same values.


               Duration  Discount
Courses Fee                      
Hadoop  26000         1         1
PySpark 25000         2         2
Python  22000         1         1
Spark   20000         2         2
        35000         1         1

So when you want group by count just select a column, you can event select from your group columns.


# Groupby multiple columns
result = df.groupby(['Courses','Fee'])['Courses'].count()
print(result)

2. Using aggregate()

Alternatively, you can also use the aggregate() function. This takes the count function as a string param.


# Groupby multiple columns and aggregate()
result = df.groupby(['Courses','Fee'])['Courses'].aggregate('count')
print(result)

Yields below output.


Courses  Fee  
Hadoop   26000    1
PySpark  25000    2
Python   22000    1
Spark    20000    2
         35000    1
Name: Duration, dtype: int64

3. pandas Multiple Aggregations Example

You can also compute multiple aggregations at the same time in pandas by using the list to the aggregate().


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

Yields below output.


           min    max
Courses              
Hadoop   26000  26000
PySpark  25000  25000
Python   22000  22000
Spark    20000  35000

The above example calculates min and max on the Fee column. Let’s extend this to compute different aggregations on different columns.

Note that applying multiple aggregations to a single column in pandas DataFrame will result in a MultiIndex.


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.


        Duration    Fee       
           count    min    max
Courses                       
Hadoop         1  26000  26000
PySpark        2  25000  25000
Python         1  22000  22000
Spark          3  20000  35000

Conclusion

In this article, you have learned how to group DataFrame rows by multiple columns and also learned how to compute different aggregations on a column.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas GroupBy Multiple Columns Explained