Use DataFrame.groupby().sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy
object which contains an aggregate function sum()
to calculate a sum of a given column for each group.
In this article, I will explain how to use groupby() and sum() functions together with examples. group by & sum on single & multiple columns is accomplished by multiple ways in pandas, some among them are groupby()
, pivot()
, transform()
, and aggregate()
functions.
1. Create Pandas DataFrame With Sample Data
In order to explain several examples of how to perform pandas group by and sum on DataFrame, first, create a DataFrame.
# Create Pandas DataFrame With Sample Data
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 55days 1000
3 Python 24000 40days 1200
4 Pandas 26000 60days 2500
5 Hadoop 25000 35days 1300
6 Spark 25000 55days 1400
7 Python 22000 50days 1600
2. Pandas groupby() & sum() by Column Name
Pandas groupby() method is used to group the identical data into a group so that you can apply aggregate functions, this groupby() method returns a DataFrameGroupBy
object which contains aggregate methods like sum, mean e.t.c. For example df.groupby(['Courses']).sum()
groups data on Courses
column and calculates the sum for all numeric columns of DataFrame.
Note that the group key you are using becomes an Index of the resulted DataFrame. In order to remove this ad add an Index use as_index =False
parameter, I will covert this in one of the examples below.
# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)
Yields below output.
# Output:
Fee Discount
Courses
Hadoop 48000 2300
Pandas 26000 2500
PySpark 25000 2300
Python 46000 2800
Spark 47000 2400
You can also explicitly specify on which column you wanted to do a sum() operation. The below example applies the sum on the Fee
column.
# Use GroupBy() & compute sum on specific column
df2 = df.groupby('Courses')['Fee'].sum()
print(df2)
Yields below output.
# Output:
Courses
Hadoop 48000
Pandas 26000
PySpark 25000
Python 46000
Spark 47000
Name: Fee, dtype: int64
3. Pandas groupby() & sum() on Multiple Columns
You can also send a list of columns you wanted group to groupby() method, using this you can apply a group by on multiple columns and calculate a sum over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].sum()
does group on Courses
and Duration
column and finally calculates the sum.
# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)
Yields below output.
# Output:
Courses Duration
Hadoop 35days 25000
55days 23000
Pandas 60days 26000
PySpark 50days 25000
Python 40days 24000
50days 22000
Spark 30days 22000
55days 25000
Name: Fee, dtype: int64
4. Group By and Get sum()
You can also use df.groupby('Courses')['Fee'].agg(['sum','count'])
you will get both sum() and count() on groupby(), you don’t want to reset the index.
# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)
# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)
Yields below output.
# Output:
Fee
sum count
Courses
Hadoop 48000 2
Pandas 26000 1
PySpark 25000 1
Python 46000 2
Spark 47000 2
5. Sort Descending order groupby Keys
By default groupby() method sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False
param.
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
If you wanted to sort key descending order, use below.
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
In case if you wanted to sort by a different key, you use something like below.
# Sort different key
df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))
6. Add Index to groupby() Result Using reset_index()
As you notice above group by columns Courses
and Fee
becomes Index of the DataFrame, In order to get these as a SQL like group by use as_index =False
param or use reset_index()
. reset_index() function is used to set the index on DataFrame.
# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()
# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
Above two examples yield below output.
# Output:
Courses Fee
0 Hadoop 48000
1 Pandas 26000
2 PySpark 25000
3 Python 46000
4 Spark 47000
7. Pandas Group By & Sum Using agg() Aggregate Function
Instead of using GroupBy.sum() function you can also use GroupBy.agg(‘sum’) to aggreagte pandas DataFrame results. For example df.groupby(['Courses','Duration'])['Discount'].agg("sum")
.
# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)
Yields below output.
# Output:
Courses Duration
Hadoop 35days 1300
55days 1000
Pandas 60days 2500
PySpark 50days 2300
Python 40days 1200
50days 1600
Spark 30days 1000
55days 1400
Name: Discount, dtype: int64
8. Using transform() Function with DataFrame.GoupBy().sum()
You can also transformed the groupby() result. For example, df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
will calculate the total number of one group with function sum, the result is a series with the same index as the original DataFrame.
# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)
Yields below output.
# Output:
0 1000
1 2300
2 1000
3 1200
4 2500
5 1300
6 1400
7 1600
Name: Discount, dtype: int64
9. Using pivot() Function With DataFrame.GroupBy().sum()
df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
uses pivot() function to organize data nicely after group and sum.
# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 55days 1000
3 Python 24000 40days 1200
4 Pandas 26000 60days 2500
5 Hadoop 25000 35days 1300
6 Spark 25000 55days 1400
7 Python 22000 50days 1600
10. Pandas DataFrame.set_index Using Sum with Level
You can also use df.set_index(['Courses','Duration']).sum(level=[0,1])
to set the GroupBy column to index than using sum with level.
# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)
Yields below output.
# Output:
Fee Discount
Courses Duration
Spark 30days 22000 1000
PySpark 50days 25000 2300
Hadoop 55days 23000 1000
Python 40days 24000 1200
Pandas 60days 26000 2500
Hadoop 35days 25000 1300
Spark 55days 25000 1400
Python 50days 22000 1600
11. Complete Example For GroupBy and Sum in Pandas DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)
# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)
# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)
# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)
# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()
# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)
# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)
# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)
# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)
Conclusion
In this article, you have learned to GroupBy and sum from pandas DataFrame using groupby()
, pivot()
, transform()
, and aggregate()
function. Also, you have learned to Pandas groupby() & sum() on multiple columns.
Happy Learning !!
Related Articles
- Pandas Remap Values in Column with a Dictionary (Dict)
- What is pandas groupby – group DataFrame Rows List
- Pandas groupby() and count() with Examples
- Pandas groupby() Explained With Examples
- Pandas Group Rows into List Using groupby()
- Split Pandas DataFrame by Column Value
- Pandas Groupby Transform
- Pandas Groupby Aggregate Explained
Thank you for this excellent tutorial on groupby() and sum() on a Pandas DataFrame; this is the best tutorial with example working code that I have found on the net! Much respect to you.
This is so awesome! Thank you so much for giving such a detailed and easy to understand tutorial!