• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:20 mins read
You are currently viewing Pandas groupby() and sum() With Examples

Use DataFrame.groupby().sum() function to group rows based on one or multiple columns and calculate the sum of these grouped data. groupby() function returns a DataFrameGroupBy object which can be used for performing aggregate functions on each group.

In this article, I will explain how to use groupby() and sum() functions together with examples. group by & sum on single & multiple columns is accomplished in multiple ways in pandas, some of them are groupby(), pivot(), transform(), and aggregate() functions.

1. Create Pandas DataFrame With Sample Data

In order to explain several examples of how to perform pandas group by and sum on DataFrame, first, create a DataFrame.


# Create Pandas DataFrame With Sample Data
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print("Create DataFrame:\n", df)

Yields below output.

Pandas groupby Sum

2. Pandas groupby() & sum() by Column Name

Pandas groupby() method is used to group identical data into a group so that you can apply aggregate functions, this groupby() method returns a DataFrameGroupBy object which is used to apply aggregate functions on grouped data. For example, df.groupby(['Courses']).sum()this syntax for grouping the data of a Courses column and calculates the sum for all numeric columns of DataFrame.

Note that the group key you are using becomes an Index of the resulting DataFrame. In order to remove this add an Index use as_index =False parameter, I will covert this in one of the examples below.


# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print("Get sum of the grouped data:\n", df2)

Yields below output.

Pandas groupby Sum

You can also explicitly specify in which column you want to do a sum() operation. The below example applies the sum to the Fee column.


# Use GroupBy() & compute sum on specific column
df2 = df.groupby('Courses')['Fee'].sum()
print("Get sum of the grouped data:\n", df2)

Yields below output.


# Output:
# Get sum of the grouped data:
Courses
Hadoop     48000
Pandas     26000
PySpark    25000
Python     46000
Spark      47000
Name: Fee, dtype: int64

3. Pandas groupby() & sum() on Multiple Columns

You can also send a list of columns you want the group to groupby() method, using this you can apply a group by on multiple columns and calculate a sum over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].sum() does group on Courses and Duration column and finally calculates the sum.


# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print("Get sum of the grouped data:\n", df2)

Yields below output.


# Output:
# Get sum of the grouped data:
Courses  Duration
Hadoop   35days      25000
         55days      23000
Pandas   60days      26000
PySpark  50days      25000
Python   40days      24000
         50days      22000
Spark    30days      22000
         55days      25000
Name: Fee, dtype: int64

4. Group By Get sum() and count()

You can also use df.groupby('Courses')['Fee'].agg(['sum','count']) you will get both sum() and count() on groupby(), you don’t want to reset the index.


# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print("Get sum & count of the grouped data:\n", df2)

# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print("Get sum & count of the grouped data:\n", df2)

Yields below output.


# Output:
# Get sum & count of the grouped data:
           Fee      
           sum count
Courses             
Hadoop   48000     2
Pandas   26000     1
PySpark  25000     1
Python   46000     2
Spark    47000     2

5. Sort Descending order groupby Keys

By default groupby() method sorts the results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False param.


# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

If you want to sort keys in descending order, use the below.


# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

In case, you want to sort by a different key, you can use something like below.


# Sort different key
df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))

6. Add Index to groupby() Result Using reset_index()

As you notice above group by columns Courses and Fee becomes Index of the DataFrame, To get these as a SQL-like groupby use as_index =False param or use reset_index(). reset_index() function is used to set the index on DataFrame.


# Using as_index=False set the index 
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()

# Using reset_index() set the index 
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

The above two examples yield the below output.


# Output:
   Courses    Fee
0   Hadoop  48000
1   Pandas  26000
2  PySpark  25000
3   Python  46000
4    Spark  47000

7. Pandas Group By & Sum Using agg() Aggregate Function

Instead of using the groupby.sum() function you can use aggregate function groupby.agg(‘sum’) to aggregate Pandas DataFrame results. For example df.groupby(['Courses','Duration'])['Discount'].agg("sum").


# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print("Get sum of the grouped data:\n", df2)

Yields below output.


# Output:
# Get sum of the grouped data:
Courses  Duration
Hadoop   35days      1300
         55days      1000
Pandas   60days      2500
PySpark  50days      2300
Python   40days      1200
         50days      1600
Spark    30days      1000
         55days      1400
Name: Discount, dtype: int64

8. Using transform() Function with DataFrame.GoupBy().sum()

You can also transform the groupby() result. For example, df.groupby(['Courses', 'Fee'])['Discount'].transform('sum') will calculate the total number of one group with function sum, the result is a series with the same index as the original DataFrame.


# Using transform() method group the data
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)

Yields below output.


# Output:
0    1000
1    2300
2    1000
3    1200
4    2500
5    1300
6    1400
7    1600
Name: Discount, dtype: int64

9. Using pivot() Function With DataFrame.GroupBy().sum()

df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0) uses the pivot() function to organize data nicely after group and sum.


# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   55days      1000
3   Python  24000   40days      1200
4   Pandas  26000   60days      2500
5   Hadoop  25000   35days      1300
6    Spark  25000   55days      1400
7   Python  22000   50days      1600

10. Pandas DataFrame.set_index Using Sum with Level

You can also use df.set_index(['Courses','Duration']).sum(level=[0,1]) to set the groupby column to index rather than using sum with level.


# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)

Yields below output.


# Output:
                    Fee  Discount
Courses Duration                 
Spark   30days    22000      1000
PySpark 50days    25000      2300
Hadoop  55days    23000      1000
Python  40days    24000      1200
Pandas  60days    26000      2500
Hadoop  35days    25000      1300
Spark   55days    25000      1400
Python  50days    22000      1600

11. Complete Example For GroupBy and Sum in Pandas DataFrame


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)

# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)

# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)

# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)

# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()

# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)

# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)

# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)

# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)

Frequently Asked Questions on GroupBy and Sum in Pandas DataFrame

What does the groupby() function do in Pandas?

The groupby() function in Pandas is used to group rows of a DataFrame based on one or more columns. It creates a DataFrameGroupBy object which is useful for performing aggregate functions on grouped data.

How do I use groupby() to group data by a single column?

You can use groupby() function by specifying the column you want to group by. For example, grouped_df = df.groupby('specified_col')

What does the sum() function do with groupby()?

Using Pandas sum() function on a DataFrameGroupBy object, you can calculate the sum of numeric columns for each group. It’s a common aggregation function used along with groupby().

How can I group by multiple columns?

To group by multiple columns by passing a list of column names to the groupby() method. For example, grouped_df = df.groupby(['column1', 'column2'])

How do I calculate the sum for each group after using groupby()?

Use aggregate functions like sum(), mean(), count(), etc., over the DataFrameGroupBy object(which is returned by the groupby() function) to calculate statistics for each group. For example, grouped_sum = df.groupby('Courses).sum()

How can I apply sum() to specific columns after grouping?

You can select specific columns and apply sum() function after grouping. For example, grouped_sum = df.groupby('Courses')['Fee'].sum()

How do I reset the index after using groupby?

The reset_index() method is used to reset the index of the grouped DataFrame. For example, df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()

Conclusion

In this article, I have explained groupby() and sum() functions and using together how we can group the data on single/multiple columns of DataFrame and calculate the sum of these grouped data with multiple examples. Also explained using some more functions like pivot(), transform(), and aggregate() functions how to get the sum of the grouped data of single/multiple columns.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 2 Comments

  1. SKY

    This is so awesome! Thank you so much for giving such a detailed and easy to understand tutorial!

  2. Bill_from_SF

    Thank you for this excellent tutorial on groupby() and sum() on a Pandas DataFrame; this is the best tutorial with example working code that I have found on the net! Much respect to you.