• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:16 mins read
You are currently viewing Pandas Aggregate Functions with Examples

What are Pandas aggregate functions? Similar to SQL, Pandas also supports multiple aggregate functions that perform a calculation on a set of values (grouped data) and return a single value.

An aggregate is a function where the values of multiple rows are grouped to form a single summary value. Below are some of the aggregate functions supported by Pandas using DataFrame.aggregate(), Series.aggregate(), and DataFrameGroupBy.aggregate().

Aggregate FunctionsReturns the total sum for each group
count()Returns count for each group
size()Returns size for each group
sum()Returns total sum for each group
mean()Returns mean for each group. Same as average()
average()Returns average for each group. Same as mean()
std()Returns standard deviation for each group
var()Return var for each group
sem()Standard error of the mean of groups
describe()Returns different statistics
min()Returns minimum value for each group
max()Returns maximum value for each group
first()Returns the last value for each group
last()Returns the nth value for each group
nth()Returns nth value for each group
Pandas Aggregate Functions

1. Aggregate Functions Syntax

Following are the Pandas methods you can use aggregate functions with. Note that you can also use agg(). All these take the agg function name specified in the above table as the argument and axis for rows/columns.


# Syntax of DataFrame.aggregate() 
DataFrame.aggregate(func=None, axis=0, *args, **kwargs)

# Syntax of Series.aggregate() 
Series.aggregate(func=None, axis=0, *args, **kwargs)

# Syntax of DataFrameGroupBy.aggregate() 
DataFrameGroupBy.aggregate(func=None, *args, engine=None, engine_kwargs=None, **kwargs)

I will create a very simple DataFrame to explain these functions to compute aggregations.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
    'Fee' :[20000,25000,26000,22000,24000,35000],
    'Duration':['30day','40days','35days','40days','60days','60days'],
    'Discount':[1000,2300,1200,2500,2000,2000]
              }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas aggregate functions

2. Using Aggregate Functions on DataFrame

You can use Pandas DataFrame.aggregate() function to calculate any aggregations on the selected columns of DataFrame and apply multiple aggregations at the same time.

The below example df[['Fee','Discount']] returns a DataFrame with two columns and aggregate('sum') returns the sum for each column.


# Using Aggregate Function on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)

# Output:
# Fee         152000
# Discount     11000
# dtype: int64

Now, let’s see how to group the rows and calculate the sum for each group. To do grouping use DataFrame.groupby() function. This function returns the DataFrameGroupBy object and uses aggregate() function to calculate the sum.


# Use DataFrame.groupby() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)

# Output:
#            Fee  Discount
# Courses                 
# Hadoop   26000      1200
# PySpark  49000      4300
# Python   22000      2500
# Spark    55000      3000

Similarly, you can also calculate aggregation for all other functions specified in the above table.

3. Using Aggregate Functions on Series

Sometimes you may need to calculate aggregation for a single column of a DataFrame. Since each column in DataFrame is a Series, I will use Series.aggregate() to compute.


# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)

# Output:
# 152000

In the above example, df['Fee'] returns a Series.

4. Using Aggregate Functions per Group

DataFrame.groupby() function is used to collect the identical data into groups and perform aggregate functions on the grouped data. This function returns DataFrameGroupBy object where several aggregate functions are defined.

By default, it calculates specified aggregation functions on all numeric columns.


# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)

# Output:
#            Fee  Discount
# Courses                 
# Hadoop   26000      1200
# PySpark  49000      4300
# Python   22000      2500
# Spark    55000      3000

If you want to calculate the aggregation on selected columns, then select the columns from DataFrameGroupBy object. For example, df.groupby('Courses')['Fee','Duration'] selects Fee and Duration columns.


# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)

Alternatively, you can also write the above statement as below.


# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)

Instead of aggregate() function, you can also directly specify the sum() function.


# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)

5. Pandas Multiple Aggregate Functions

You can also apply multiple aggregate functions at the same time in Pandas on a group results by using the list to the aggregate().


# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

Yields below output.


# Output:
           min    max
Courses              
Hadoop   26000  26000
PySpark  24000  25000
Python   22000  22000
Spark    20000  35000

The above example calculates min and max on the Fee column. Let’s extend this to compute different aggregations on different columns.

Note that applying multiple aggregations to a single column in pandas DataFrame will result in a MultiIndex.


# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.


# Output:
        Duration    Fee       
           count    min    max
Courses                       
Hadoop         1  26000  26000
PySpark        2  24000  25000
Python         1  22000  22000
Spark          2  20000  35000

6. Complete Example of Aggregate Functions


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
    'Fee' :[20000,25000,26000,22000,24000,3000],
    'Duration':['30day','40days','35days','40days','60days','60days'],
    'Discount':[1000,2300,1200,2500,2000,2000]
              }
df = pd.DataFrame(technologies)
print(df)

# Using Aggregate Functions on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)

# Use DataFrame.group() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)

# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)

# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)

# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)

# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)

# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)

# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)

# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)

Frequently Asked Questions (FAQ) on Pandas Aggregate Functions

What are Pandas Aggregate Functions?

Pandas Aggregate Functions are functions that allow you to perform operations on data, typically in the form of grouping and summarizing, to derive meaningful insights from datasets.

How do I use Aggregate Functions in Pandas?

You can use the groupby() function to group the data based on particular criteria and then apply aggregate functions like sum(), mean(), min(), max(), etc., to the grouped data.

What is the purpose of the agg() function in Pandas?

The agg() function in Pandas is used to apply multiple aggregate functions simultaneously. It allows you to specify different aggregation functions for different columns.

How can I perform custom aggregation in Pandas?

You can use the agg() function with a custom function or use the apply() function to apply a custom aggregation function to your data.

How do I handle missing values when using Pandas Aggregate Functions?

By default, Pandas Aggregate Functions exclude missing values (NaN). You can use the skipna parameter to control this behavior.

How can I use Pandas Aggregate Functions with multiple grouping levels?

You can use multiple columns for grouping to create a multi-level index, and then apply aggregate functions accordingly.

Conclusion

In this article, I have explained aggregate functions calculated agg for each group to form a single summary value. You can do this agg in several ways by using DataFrame.aggregate(), Series.aggregate(), DataFrameGroupBy.aggregate(). Also, learned how to apply multiple aggregations at the same time with examples.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium