Pandas Aggregate Functions with Examples

What are pandas aggregate functions? Similar to SQL, pandas also supports multiple aggregate functions that perform a calculation on a set of values (grouped data) and return a single value.

An aggregate is a function where the values of multiple rows are grouped together to form a single summary value. Below are some of the aggregate functions supported by pandas using DataFrame.aggregate(), Series.aggregate(), DataFrameGroupBy.aggregate().

Aggregate FunctionsDescription
count()Returns count for each group
size()Returns size for each group
sum()Returns total sum for each group
mean()Returns mean for each group. Same as average()
average()Returns average for each group. Same as mean()
std()Returns standard deviation for each group
var()Return var for each group
sem()Standard error of the mean of groups
describe()Returns different statistics
min()Returns minimum value for each group
max()Returns maximum value for each group
first()Returns first value for each group
last()Returns last value for each group
nth()Returns nth value for each group
Pandas Aggregate Functions

1. Aggregate Functions Syntax

Following are the pandas methods you can use aggregate functions with. Note that you can also use agg(). All these take agg function name specified in the above table as argument and axis for rows/columns.


# Syntax of DataFrame.aggregate() 
DataFrame.aggregate(func=None, axis=0, *args, **kwargs)

# Syntax of Series.aggregate() 
Series.aggregate(func=None, axis=0, *args, **kwargs)

# Syntax of DataFrameGroupBy.aggregate() 
DataFrameGroupBy.aggregate(func=None, *args, engine=None, engine_kwargs=None, **kwargs)

I will create a very simple DataFrame to explain these functions to compute aggregations.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
    'Fee' :[20000,25000,26000,22000,24000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days'],
    'Discount':[1000,2300,1200,2500,2000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Output:
#    Courses    Fee Duration  Discount
# 0    Spark  20000    30day      1000
# 1  PySpark  25000   40days      2300
# 2   Hadoop  26000   35days      1200
# 3   Python  22000   40days      2500
# 4  PySpark  24000   60days      2000
# 5    Spark   3000   60days      2000

2. Using Aggregate Functions on DataFrame

Use pandas DataFrame.aggregate() function to calculate any aggregations on the selected columns of DataFrame and apply multiple aggregations at the same time.

The below example df[['Fee','Discount']] returns a DataFrame with two columns and aggregate('sum') returns the sum for each column.


# Using Aggregate Function on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)

# Output:
# Fee         152000
# Discount     11000
# dtype: int64

Now, let’s see how to group the rows and calculate the sum for each group. To do grouping use DataFrame.groupby() function. This function returns the DataFrameGroupBy object and use aggregate() function to calculate the sum.


# Use DataFrame.group() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)

# Output:
#            Fee  Discount
# Courses                 
# Hadoop   26000      1200
# PySpark  49000      4300
# Python   22000      2500
# Spark    55000      3000

Similarly, you can also calculate aggregation for all other functions specified in the above table.

3. Using Aggregate Functions on Series

Sometimes you may need to calculate aggregation for a single column of a DataFrame. Since each column in DataFrame is a Series, I will use Series.aggregate() to compute.


# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)

# Output:
# 152000

In the above example, df['Fee'] returns a Series.

4. Using Aggregate Functions per Group

DataFrame.groupby() function is used to collect the identical data into groups and perform aggregate functions on the grouped data. This function returns DataFrameGroupBy object where several aggregate functions are defined.

By default, it calculates specified aggregation functions on all numeric columns.


# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)

# Output:
#            Fee  Discount
# Courses                 
# Hadoop   26000      1200
# PySpark  49000      4300
# Python   22000      2500
# Spark    55000      3000

If you wanted to calculate the aggregation on selected columns, then select the columns from DataFrameGroupBy object. For example df.groupby('Courses')['Fee','Duration'] selects Fee and Duration columns.


# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)

Alternatively, you can also write the above statement as below.


# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)

Instead of aggregate() function, you can also directly specify the sum() function.


# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)

5. pandas Multiple Aggregate Functions

You can also apply multiple aggregate functions at the same time in pandas on a group results by using the list to the aggregate().


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

Yields below output.


# Output:
           min    max
Courses              
Hadoop   26000  26000
PySpark  24000  25000
Python   22000  22000
Spark    20000  35000

The above example calculates min and max on the Fee column. Let’s extend this to compute different aggregations on different columns.

Note that applying multiple aggregations to a single column in pandas DataFrame will result in a MultiIndex.


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.


# Output:
        Duration    Fee       
           count    min    max
Courses                       
Hadoop         1  26000  26000
PySpark        2  24000  25000
Python         1  22000  22000
Spark          2  20000  35000

6. Complete Example of Aggregate Functions


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
    'Fee' :[20000,25000,26000,22000,24000,3000],
    'Duration':['30day','40days','35days','40days','60days','60days'],
    'Discount':[1000,2300,1200,2500,2000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Using Aggregate Functions on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)

# Use DataFrame.group() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)

# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)

# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)

# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)

# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)

# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)

# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Conclusion

In this article, I have explained aggregate functions calculated agg for each group to form a single summary value. You can do this agg in several ways by using DataFrame.aggregate(), Series.aggregate(), DataFrameGroupBy.aggregate(). Also, learned how to apply multiple aggregations at the same time with examples.

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Pandas Aggregate Functions with Examples