What are Pandas aggregate functions? Similar to SQL, Pandas also supports multiple aggregate functions that perform a calculation on a set of values (grouped data) and return a single value.
An aggregate is a function where the values of multiple rows are grouped to form a single summary value. Below are some of the aggregate functions supported by Pandas using DataFrame.aggregate()
, Series.aggregate()
, and DataFrameGroupBy.aggregate()
.
Aggregate Functions | Returns the total sum for each group |
---|---|
count() | Returns count for each group |
size() | Returns size for each group |
sum() | Returns total sum for each group |
mean() | Returns mean for each group. Same as average() |
average() | Returns average for each group. Same as mean() |
std() | Returns standard deviation for each group |
var() | Return var for each group |
sem() | Standard error of the mean of groups |
describe() | Returns different statistics |
min() | Returns minimum value for each group |
max() | Returns maximum value for each group |
first() | Returns the last value for each group |
last() | Returns the nth value for each group |
nth() | Returns nth value for each group |
1. Aggregate Functions Syntax
Following are the Pandas methods you can use aggregate functions with. Note that you can also use agg(). All these take the agg function name specified in the above table as the argument and axis for rows/columns.
# Syntax of DataFrame.aggregate()
DataFrame.aggregate(func=None, axis=0, *args, **kwargs)
# Syntax of Series.aggregate()
Series.aggregate(func=None, axis=0, *args, **kwargs)
# Syntax of DataFrameGroupBy.aggregate()
DataFrameGroupBy.aggregate(func=None, *args, engine=None, engine_kwargs=None, **kwargs)
I will create a very simple DataFrame to explain these functions to compute aggregations.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
'Fee' :[20000,25000,26000,22000,24000,35000],
'Duration':['30day','40days','35days','40days','60days','60days'],
'Discount':[1000,2300,1200,2500,2000,2000]
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
2. Using Aggregate Functions on DataFrame
You can use Pandas DataFrame.aggregate()
function to calculate any aggregations on the selected columns of DataFrame and apply multiple aggregations at the same time.
The below example df[['Fee','Discount']]
returns a DataFrame with two columns and aggregate('sum')
returns the sum for each column.
# Using Aggregate Function on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)
# Output:
# Fee 152000
# Discount 11000
# dtype: int64
Now, let’s see how to group the rows and calculate the sum for each group. To do grouping use DataFrame.groupby() function. This function returns the DataFrameGroupBy
object and uses aggregate()
function to calculate the sum.
# Use DataFrame.groupby() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)
# Output:
# Fee Discount
# Courses
# Hadoop 26000 1200
# PySpark 49000 4300
# Python 22000 2500
# Spark 55000 3000
Similarly, you can also calculate aggregation for all other functions specified in the above table.
3. Using Aggregate Functions on Series
Sometimes you may need to calculate aggregation for a single column of a DataFrame. Since each column in DataFrame is a Series, I will use Series.aggregate() to compute.
# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)
# Output:
# 152000
In the above example, df['Fee']
returns a Series.
4. Using Aggregate Functions per Group
DataFrame.groupby() function is used to collect the identical data into groups and perform aggregate functions on the grouped data. This function returns DataFrameGroupBy
object where several aggregate functions are defined.
By default, it calculates specified aggregation functions on all numeric columns.
# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)
# Output:
# Fee Discount
# Courses
# Hadoop 26000 1200
# PySpark 49000 4300
# Python 22000 2500
# Spark 55000 3000
If you want to calculate the aggregation on selected columns, then select the columns from DataFrameGroupBy
object. For example, df.groupby('Courses')['Fee','Duration']
selects Fee
and Duration
columns.
# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)
Alternatively, you can also write the above statement as below.
# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)
Instead of aggregate()
function, you can also directly specify the sum()
function.
# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)
5. Pandas Multiple Aggregate Functions
You can also apply multiple aggregate functions at the same time in Pandas on a group results by using the list to the aggregate()
.
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
Yields below output.
# Output:
min max
Courses
Hadoop 26000 26000
PySpark 24000 25000
Python 22000 22000
Spark 20000 35000
The above example calculates min
and max
on the Fee
column. Let’s extend this to compute different aggregations on different columns.
Note that applying multiple aggregations to a single column in pandas DataFrame will result in a MultiIndex.
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.
# Output:
Duration Fee
count min max
Courses
Hadoop 1 26000 26000
PySpark 2 24000 25000
Python 1 22000 22000
Spark 2 20000 35000
6. Complete Example of Aggregate Functions
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","PySpark","Spark"],
'Fee' :[20000,25000,26000,22000,24000,3000],
'Duration':['30day','40days','35days','40days','60days','60days'],
'Discount':[1000,2300,1200,2500,2000,2000]
}
df = pd.DataFrame(technologies)
print(df)
# Using Aggregate Functions on DataFrame
result = df[['Fee','Discount']].aggregate('sum')
print(result)
# Use DataFrame.group() Function
result = df.groupby('Courses')['Fee','Discount'].aggregate('sum')
print(result)
# Using Aggregate Function on Series
value = df['Fee'].aggregate('sum')
print(value)
# Using groupby() and aggreaget()
result = df.groupby('Courses').aggregate('sum')
print(result)
# Using groupby() and aggreaget()
result = df.groupby('Courses')['Fee','Duration'].aggregate('sum')
print(result)
# Alternate way
result = df[['Courses','Fee','Duration']].groupby('Courses').aggregate('sum')
print(result)
# Directly using sum() function
result = df.groupby('Courses').sum()
print(result)
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Frequently Asked Questions (FAQ) on Pandas Aggregate Functions
Pandas Aggregate Functions are functions that allow you to perform operations on data, typically in the form of grouping and summarizing, to derive meaningful insights from datasets.
You can use the groupby()
function to group the data based on particular criteria and then apply aggregate functions like sum()
, mean()
, min()
, max()
, etc., to the grouped data.
The agg()
function in Pandas is used to apply multiple aggregate functions simultaneously. It allows you to specify different aggregation functions for different columns.
You can use the agg()
function with a custom function or use the apply()
function to apply a custom aggregation function to your data.
By default, Pandas Aggregate Functions exclude missing values (NaN
). You can use the skipna
parameter to control this behavior.
You can use multiple columns for grouping to create a multi-level index, and then apply aggregate functions accordingly.
Conclusion
In this article, I have explained aggregate functions calculated agg for each group to form a single summary value. You can do this agg in several ways by using DataFrame.aggregate()
, Series.aggregate()
, DataFrameGroupBy.aggregate()
. Also, learned how to apply multiple aggregations at the same time with examples.
Related Articles
- Pandas Groupby Aggregate Explained
- Pandas Groupby Transform
- Calculate Summary Statistics in Pandas
- Pandas apply map (applymap()) Explained
- How to Generate Time Series Plot in Pandas
- How to Create Pandas Pivot Table Count
- How to Create Pandas Pivot Multiple Columns
- Pandas Series filter() Function
- How to GroupBy Index in Pandas?