Use DataFrame.groupby().sum()
function to group rows based on one or multiple columns and calculate the sum of these grouped data. groupby()
function returns a DataFrameGroupBy object which can be used for performing aggregate functions on each group.
In this article, I will explain how to use groupby()
and sum()
functions together with examples. group by & sum on single & multiple columns is accomplished in multiple ways in pandas, some of them are groupby()
, pivot()
, transform()
, and aggregate()
functions.
Key Points –
- The
groupby()
function is used to group data in a DataFrame based on one or more columns, allowing for aggregation or transformation of the grouped data. - After grouping data with
groupby()
, thesum()
method can be used to calculate the sum of numeric values for each group. - When applying
groupby()
withsum()
, you can group by multiple columns, and the sum will be computed for each unique combination of the group keys. - You can specify which columns to sum after grouping, either by selecting them before applying
sum()
or by using theagg()
method. - After performing
groupby().sum()
, the result often has a hierarchical index. You can reset the index withreset_index()
if you need a flat DataFrame. - By default, the groups created by
groupby()
are sorted by the group keys. You can disable sorting using thesort=False
parameter if needed for performance.
Create Pandas DataFrame With Sample Data
In order to explain several examples of how to perform pandas group by and sum on DataFrame, first, create a DataFrame.
# Create Pandas DataFrame With Sample Data
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print("Create DataFrame:\n", df)
Yields below output.
Pandas groupby() & sum() by Column Name
Pandas groupby() method is used to group identical data into a group so that you can apply aggregate functions, this groupby() method returns a DataFrameGroupBy
object which is used to apply aggregate functions on grouped data. For example, df.groupby(['Courses']).sum()
this syntax for grouping the data of a Courses
column and calculates the sum for all numeric columns of DataFrame.
Note that the group key you are using becomes an Index of the resulting DataFrame. In order to remove this add an Index use as_index =False
parameter, I will covert this in one of the examples below.
# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print("Get sum of the grouped data:\n", df2)
Yields below output.
You can also explicitly specify in which column you want to do a sum() operation. The below example applies the sum to the Fee
column.
# Use GroupBy() & compute sum on specific column
df2 = df.groupby('Courses')['Fee'].sum()
print("Get sum of the grouped data:\n", df2)
Yields below output.
# Output:
# Get sum of the grouped data:
Courses
Hadoop 48000
Pandas 26000
PySpark 25000
Python 46000
Spark 47000
Name: Fee, dtype: int64
Pandas groupby() & sum() on Multiple Columns
You can also send a list of columns you want the group to groupby() method, using this you can apply a group by on multiple columns and calculate a sum over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].sum()
does group on Courses
and Duration
column and finally calculates the sum.
# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print("Get sum of the grouped data:\n", df2)
Yields below output.
# Output:
# Get sum of the grouped data:
Courses Duration
Hadoop 35days 25000
55days 23000
Pandas 60days 26000
PySpark 50days 25000
Python 40days 24000
50days 22000
Spark 30days 22000
55days 25000
Name: Fee, dtype: int64
Group By Get sum() and count()
You can also use df.groupby('Courses')['Fee'].agg(['sum','count'])
you will get both sum() and count() on groupby(), you don’t want to reset the index.
# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print("Get sum & count of the grouped data:\n", df2)
# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print("Get sum & count of the grouped data:\n", df2)
Yields below output.
# Output:
# Get sum & count of the grouped data:
Fee
sum count
Courses
Hadoop 48000 2
Pandas 26000 1
PySpark 25000 1
Python 46000 2
Spark 47000 2
Sort Descending order groupby Keys
By default groupby()
method sorts the results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False
param.
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
If you want to sort keys in descending order, use the below.
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
In case, you want to sort by a different key, you can use something like below.
# Sort different key
df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))
Add Index to groupby() Result Using reset_index()
As you notice above group by columns Courses
and Fee
becomes Index of the DataFrame, To get these as a SQL-like groupby use as_index =False
param or use reset_index()
. reset_index() function is used to set the index on DataFrame.
# Using as_index=False set the index
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()
# Using reset_index() set the index
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
The above two examples yield the below output.
# Output:
Courses Fee
0 Hadoop 48000
1 Pandas 26000
2 PySpark 25000
3 Python 46000
4 Spark 47000
Pandas Group By & Sum Using agg() Aggregate Function
Instead of using the groupby.sum() function you can use aggregate function groupby.agg(‘sum’) to aggregate Pandas DataFrame results. For example df.groupby(['Courses','Duration'])['Discount'].agg("sum")
.
# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print("Get sum of the grouped data:\n", df2)
Yields below output.
# Output:
# Get sum of the grouped data:
Courses Duration
Hadoop 35days 1300
55days 1000
Pandas 60days 2500
PySpark 50days 2300
Python 40days 1200
50days 1600
Spark 30days 1000
55days 1400
Name: Discount, dtype: int64
Using transform() Function with DataFrame.GoupBy().sum()
You can also transform the groupby() result. For example, df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
will calculate the total number of one group with function sum, the result is a series with the same index as the original DataFrame.
# Using transform() method group the data
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)
Yields below output.
# Output:
0 1000
1 2300
2 1000
3 1200
4 2500
5 1300
6 1400
7 1600
Name: Discount, dtype: int64
Using pivot() Function With DataFrame.GroupBy().sum()
df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
uses the pivot() function to organize data nicely after group and sum.
# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 55days 1000
3 Python 24000 40days 1200
4 Pandas 26000 60days 2500
5 Hadoop 25000 35days 1300
6 Spark 25000 55days 1400
7 Python 22000 50days 1600
Pandas DataFrame.set_index Using Sum with Level
You can also use df.set_index(['Courses','Duration']).sum(level=[0,1])
to set the groupby column to index rather than using sum with level.
# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)
Yields below output.
# Output:
Fee Discount
Courses Duration
Spark 30days 22000 1000
PySpark 50days 25000 2300
Hadoop 55days 23000 1000
Python 40days 24000 1200
Pandas 60days 26000 2500
Hadoop 35days 25000 1300
Spark 55days 25000 1400
Python 50days 22000 1600
Complete Example For GroupBy and Sum in Pandas DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)
# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)
# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)
# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)
# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()
# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)
# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)
# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)
# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)
Frequently Asked Questions on GroupBy and Sum in Pandas DataFrame
The groupby()
function in Pandas is used to group rows of a DataFrame based on one or more columns. It creates a DataFrameGroupBy object which is useful for performing aggregate functions on grouped data.
You can use groupby()
function by specifying the column you want to group by. For example, grouped_df = df.groupby('specified_col')
Using Pandas sum()
function on a DataFrameGroupBy object, you can calculate the sum of numeric columns for each group. It’s a common aggregation function used along with groupby()
.
To group by multiple columns by passing a list of column names to the groupby()
method. For example, grouped_df = df.groupby(['column1', 'column2'])
Use aggregate functions like sum()
, mean()
, count()
, etc., over the DataFrameGroupBy object(which is returned by the groupby() function) to calculate statistics for each group. For example, grouped_sum = df.groupby('Courses).sum()
You can select specific columns and apply sum()
function after grouping. For example, grouped_sum = df.groupby('Courses')['Fee'].sum()
The reset_index()
method is used to reset the index of the grouped DataFrame. For example, df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
Conclusion
In this article, I have explained groupby() and sum() functions and using together how we can group the data on single/multiple columns of DataFrame and calculate the sum of these grouped data with multiple examples. Also explained using some more functions like pivot()
, transform()
, and aggregate()
functions how to get the sum of the grouped data of single/multiple columns.
Happy Learning !!
Related Articles
- Pandas groupby Transform
- Pandas groupby Aggregate Explained
- How to Shuffle Pandas Rows Randomly
- How to GroupBy Index in Pandas?
- PySpark GroupBy Count – Explained
- Split Pandas DataFrame by Column Value
- Pandas Group Rows into List Using groupby()
- Pandas groupby() and count() with Examples
- Pandas GroupBy Multiple Columns Explained
- Different Ways to Get Row Count in Pandas DataFrame
- Pandas DataFrame – Different Ways to Iterate Over Rows
- Pandas Remap Values in Column with a Dictionary (Dict)
This is so awesome! Thank you so much for giving such a detailed and easy to understand tutorial!
Thank you for this excellent tutorial on groupby() and sum() on a Pandas DataFrame; this is the best tutorial with example working code that I have found on the net! Much respect to you.