Pandas groupby() and sum() With Examples

Pandas provide a groupby() function on DataFrame that takes one or multiple columns (as a list) to group the data and returns a GroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group. In this article, I will explain how to use groupby() and sum() functions together with examples.

Groupby & sum on single & multiple columns is accomplished by multiple ways in pandas, some among them are groupby(), pivot(), transform(), and aggregate() functions.

1. Create Pandas DataFrame With Sample Data

In order to explain several examples of how to perform Pandas groupby and sum on DataFrame, first, create a DataFrame.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30day','50days','55days', '40days','60days','35day','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000    30day      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   55days      1000
3   Python  24000   40days      1200
4   Pandas  26000   60days      2500
5   Hadoop  25000    35day      1300
6    Spark  25000   55days      1400
7   Python  22000   50days      1600

2. Pandas groupby() & sum() by Column Name

Pandas groupby() method is used to group the identical data into a group so that you can apply aggregate functions, this groupby() method returns a GroupBy object which contains aggregate methods like sum, mean e.t.c. For example df.groupby(['Courses']).sum() groups data on Courses column and calculates the sum for all numeric columns of DataFrame.

Note that the group key you are using becomes an Index of the resulted DataFrame. In order to remove this ad add an Index use as_index =False parameter, I will covert this in one of the examples below.


# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)

Yields below output.


           Fee  Discount
Courses                 
Hadoop   48000      2300
Pandas   26000      2500
PySpark  25000      2300
Python   46000      2800
Spark    47000      2400

You can also explicitly specify on which column you wanted to do a sum() operation. The below example applies the sum on the Fee column.


# Use GroupBy() & compute sum on specific column
df2 = df.groupby('Courses')['Fee'].sum()
print(df2)

Yields below output.


Courses
Hadoop     48000
Pandas     26000
PySpark    25000
Python     46000
Spark      47000
Name: Fee, dtype: int64

3. Pandas groupby() & sum() on Multiple Columns

You can also send a list of columns you wanted group to groupby() method, using this you can apply a group by on multiple columns and calculate a sum over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].sum() does group on Courses and Duration column and finally calculates the sum.


# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)

Yields below output.


Courses  Duration
Hadoop   35day       25000
         55days      23000
Pandas   60days      26000
PySpark  50days      25000
Python   40days      24000
         50days      22000
Spark    30day       22000
         55days      25000
Name: Fee, dtype: int64

4. Group By and Get sum() AND count()

You can also use df.groupby('Courses')['Fee'].agg(['sum','count']) you will get both sum() and count() on groupby(), you don’t want to reset the index.


# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)

# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)

Yields below output.


           Fee      
           sum count
Courses             
Hadoop   48000     2
Pandas   26000     1
PySpark  25000     1
Python   46000     2
Spark    47000     2

5. Sort Descending order Group By Keys

By default groupby() method sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False param.


# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

If you wanted to sort key descending order, use below.


# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

In case if you wanted to sort by a different key, you use something like below.


df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))

6. Add Index to groupby() Result Using reset_index()

As you notice above group by columns Courses and Fee becomes Index of the DataFrame, In order to get these as a SQL like group by use as_index =False param or use reset_index().


# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()

# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

Above two examples yield below output.


   Courses    Fee
0   Hadoop  48000
1   Pandas  26000
2  PySpark  25000
3   Python  46000
4    Spark  47000

7. Pandas Group By & Sum Using agg() Aggregate Function

Instead of using GroupBy.sum() function you can also use GroupBy.agg(‘sum’) to aggreagte pandas DataFrame results. For example df.groupby(['Courses','Duration'])['Discount'].agg("sum").


# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)

Yields below output.


Courses  Duration
Hadoop   35day       1300
         55days      1000
Pandas   60days      2500
PySpark  50days      2300
Python   40days      1200
         50days      1600
Spark    30day       1000
         55days      1400
Name: Discount, dtype: int64

8. Using transform() Function with DataFrame.GoupBy().sum()

You can also transformed the groupby() result. For example, df.groupby(['Courses', 'Fee'])['Discount'].transform('sum') will calculate the total number of one group with function sum, the result is a series with the same index as the original DataFrame.


# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)

Yields below output.


0    1000
1    2300
2    1000
3    1200
4    2500
5    1300
6    1400
7    1600
Name: Discount, dtype: int64

9. Using pivot() Function With DataFrame.GroupBy().sum()

df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0) uses pivot() function to organize data nicely after group and sum.


# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000    30day      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   55days      1000
3   Python  24000   40days      1200
4   Pandas  26000   60days      2500
5   Hadoop  25000    35day      1300
6    Spark  25000   55days      1400
7   Python  22000   50days      1600

10. Pandas DataFrame.set_index Using Sum with Level

You can also use df.set_index(['Courses','Duration']).sum(level=[0,1]) to set the GroupBy column to index than using sum with level.


# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)

Yields below output.


                    Fee  Discount
Courses Duration                 
Spark   30day     22000      1000
PySpark 50days    25000      2300
Hadoop  55days    23000      1000
Python  40days    24000      1200
Pandas  60days    26000      2500
Hadoop  35day     25000      1300
Spark   55days    25000      1400
Python  50days    22000      1600

11. Complete Example For GroupBy and Sum in Pandas DataFrame


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30day','50days','55days', '40days','60days','35day','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
          })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

# Use GroupBy() to compute the sum
df2 = df.groupby('Courses').sum()
print(df2)

# Using GroupBy multiple column
df2 = df.groupby(['Courses','Duration'])['Fee'].sum()
print(df2)

# Groupby and get sum() and count()
df2 = df.groupby('Courses')['Fee'].agg(['sum','count'])
print(df2)

# Pandas groupby get sum() and count()
df2 = df.groupby('Courses').agg({'Fee': ['sum','count']})
print(df2)

# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

# Using as_index=False
df2 = df.groupby('Courses', as_index =False)['Fee'].sum()

# Using reset_index()
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

# GroupBy multiple columns using agg()
df2 = df.groupby(['Courses','Duration'])['Discount'].agg("sum")
print(df2)

# GroupBy multiple columns using transform()
df2 = df.groupby(['Courses', 'Fee'])['Discount'].transform('sum')
print(df2)

# GroupBy multiple columns using pivot function
df2 = df.groupby(['Courses','Duration'],as_index = False).sum().pivot('Courses','Duration').fillna(0)
print(df2)

# DataFrame.set_index using sum with level
df2 = df.set_index(['Courses','Duration']).sum(level=[0,1])
print(df2)

Conclusion

In this article, you have learned to GroupBy and sum from pandas DataFrame using groupby(), pivot(), transform(), and aggregate() function. Also, you have learned to Pandas groupby() & sum() on multiple columns.

Happy Learning !!

You May Also Like

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has One Comment

  1. Bill_from_SF

    Thank you for this excellent tutorial on groupby() and sum() on a Pandas DataFrame; this is the best tutorial with example working code that I have found on the net! Much respect to you.

Pandas groupby() and sum() With Examples