How to GroupBy Index in Pandas?

  • Post author:
  • Post category:Pandas
  • Post last modified:December 13, 2023
  • Reading time:13 mins read

How to perform groupby index in pandas? Pass the index name of the DataFrame as a parameter to the groupby() function to group rows on an index. DataFrame.groupby() function takes a string or list as a parameter to specify the group columns or index. You can also use the combination of index and columns when using the list.

In this article, I will explain how to perform groupby on a single index, multiple indexes, or a combination of column and index with examples.

1. Quick Examples of Performing GroupBy on Index

Following are quick examples of how to perform group by on index field.


# Below are some quick examples

# Example 1: Create DataFrame
df = pd.DataFrame(technologies)
# Set Index to DataFrame
df.set_index(['Courses','Fee'], inplace=True)
print(df)

# Example 2: Groupby Index
result = df.groupby('Courses').sum()
print(result)

# Example 3: Groupby Multiple Index
result = df.groupby(['Courses','Fee']).sum()
print(result)

# Example 4: Groupby Column & Index
result = df.groupby(['Courses','Duration']).sum()
print(result)

Let’s create a pandas DataFrame from the Dict object and explore the above examples.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee' :[20000,25000,26000,22000,25000,20000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days','70days'],
    'Discount':[1000,2300,1200,2500,2000,2000,3000]
              }

df = pd.DataFrame(technologies)
df.set_index(['Courses','Fee'], inplace=True)
print("Create DataFrame:\n", df)

Yields below output. As you notice in the above example, I have used DataFrame.set_index() to set the multiple columns as Index. I will use these two indexes to group rows.

pandas groupby index

2. GroupBy Index Example

DataFrame.gorupby() accepts string or list of column or index names to perform group in pandas DataFrame. The index name is required to perform, If you don’t have it, set the name to index by using DataFrame.index.name = 'index-name'.


# Groupby Index
result = df.groupby('Courses').sum()
print("Get the grouped data beased on index:\n", result)

Yields below output.

pandas groupby index

3. GroupBy Multiple Index

Now let’s see how to group by multiple index fields at the same time, to do so pass all index names as a list. The below example groups rows by Courses and Fee index.


# Groupby Multiple Index
result = df.groupby(['Courses','Fee']).sum()
print("Get the grouped data beased on multiple indexes:\n", result)

Yields below output.


# Output:
# Get the grouped data beased on multiple indexes:
               Discount
Courses Fee            
Hadoop  26000      1200
PySpark 25000      4300
Python  22000      2500
Spark   20000      3000
        35000      3000

4. Use Both Column and Index on GroupBy

Sometimes you may also be required to do group by column and index at the same time, the good thing about groupby() function is it accepts both at the same time.


# Groupby Column & Index
result = df.groupby(['Courses','Duration']).sum()
print("Get the grouped data beased on index and column:\n", result)

Note that my DataFrame doesn’t have the right date to the group on Courses and Duration columns hence it listed all rows from DataFrame. But I hope you understand the concept of how to do it.


# Output:
# Get the grouped data beased on index and column:
                  Discount
Courses Duration          
Hadoop  35days        1200
PySpark 40days        2300
        60days        2000
Python  40days        2500
Spark   30day         1000
        60days        2000
        70days        3000

5. Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark","Spark"],
    'Fee' :[20000,25000,26000,22000,25000,20000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days','70days'],
    'Discount':[1000,2300,1200,2500,2000,2000,3000]
              }

df = pd.DataFrame(technologies)
df.set_index(['Courses','Fee'], inplace=True)
print(df)

# Groupby Index
result = df.groupby('Courses').sum()
print(result)

# Groupby Multiple Index
result = df.groupby(['Courses','Fee']).sum()
print(result)

# Groupby Column & Index
result = df.groupby(['Courses','Duration']).sum()
print(result)

Frequently Asked Questions of Pandas Groupby Index

How do I group data by the index in Pandas?

You can use the groupby() function to group the Pandas by an index for that you need to pass the index column or list of column indexes as the argument of the groupby() function. For example, grouped_data = df.groupby('index_column')

How can I group by multiple levels of the index?

You can use the groupby() function to group the Pandas by multiple levels of the index for that you need to pass the list of column indexes as the argument of the groupby() function. For example, grouped_data = df.groupby(['index_column1', <code>'index_column2'])

How to aggregate data after grouping by the index?

You can apply aggregation functions like sum(), mean(), etc., after grouping the data by using the groupby() Function. For example, grouped_data = df.groupby('index_column').sum()

How can I apply multiple aggregation functions to grouped data?

You can use the agg() method and provide dictionary mapping columns to perform multiple aggregation functions to grouped data For example, agg_functions = {'column1': 'sum', 'column2': 'mean'}<br/>result = df.groupby('index_column').agg(agg_functions)

How to reset the index after grouping?

You can use the reset_index() method to convert the grouped data back to a DataFrame with a default integer index. For example, grouped_data = df.groupby('index_column').sum() grouped_data_reset = grouped_data.reset_index()

Conclusion

In this article, you have learned how to perform group by on a single index, multiple indexes, and a combination of column and index using Pandas groupby() function with multiple examples.

Reference

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply