How to get statistics for each group (such as count, mean, max, min, etc.) using pandas GroupBy? You can achieve this by using groupby()
method and agg()
function.
In this article, you can learn pandas.DataFrame.groupby()
to group the single column, two, or multiple columns and get the size()
, count()
for each group combination. groupBy()
function is used to collect the identical data into groups and perform aggregate functions like size/count on the grouped data.
In Pandas, you can use groupby()
with the combination of count()
, size()
, mean()
, min()
, max()
and more methods.
Key Points –
- Utilize the
groupby()
function in Pandas to group data based on specified criteria. - Pandas enables grouping data by specific criteria using the
groupby()
function, facilitating analysis at a granular level. - Apply statistical aggregation functions like
mean()
,median()
,sum()
,min()
,max()
, etc., to compute statistics within each group. - Pandas’
agg()
function allows customization by accepting a dictionary mapping columns to the desired aggregation functions. - Combine groupby operations with other Pandas functionalities like filtering, sorting, and visualization to gain insights from grouped data efficiently.
Quick Examples of Pandas Get Statistics For Each Group
If you are in a hurry, below are some quick examples of pandas get statistics for each group.
# Quick examples of pandas get statistics for each group
# Use DataFrame.size()
df2=df.groupby(['Courses','Duration'])
df2=df.groupby(['Courses', 'Duration']).size().reset_index(name='counts'))
# Pandas groupby() with agg() method
df2 = df.groupby(['Courses', 'Duration']).agg(['mean', 'count'])
# Pandas get statistics Using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
# Pandas DataFrame.groupby() and describe() function
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()[['count', 'mean']]
# Get statistics by DataFrame.value_counts
df2=df.value_counts(subset=['Courses', 'Duration'])
# Using groupby() and agg() function
df2 = df.groupby(['Courses','Duration']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
Now, let’s create a DataFrame with a few rows and columns, execute these examples, and validate the results. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
# Create a DataFrame.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 55days 1000
3 Python 24000 40days 1200
4 Hadoop 26000 55days 2500
5 Hadoop 25000 35days 1200
6 Spark 25000 30days 1400
7 Python 22000 40days 1000
8 Spark 25000 40days 1200
Pandas groupby() with size() to Get Column Counts
Use DataFrame.groupby() to group the rows and use size()
to get the count on each group. The size
property is used to get an int representing the number of elements in this object. For the Series object, it returns the number of rows. For the DataFrame object, it returns the number of rows times the number of columns (rows * columns).
# Use DataFrame.size()
df2=df.groupby(['Courses','Duration'])
df2=df.groupby(['Courses', 'Duration']).size().reset_index(name='counts')
print(df2)
Yields below output. Note the groupby()
method returns the pandas.core.groupby.generic.DataFrameGroupBy
. reset_index(name='counts')
is used to set the label name for the size column.
# Output:
Courses Duration counts
0 Hadoop 35days 1
1 Hadoop 55days 2
2 PySpark 50days 1
3 Python 40days 2
4 Spark 30days 2
5 Spark 40days 1
Pandas groupby() with agg() Method
Alternatively, you can also use groupby()
method and use agg()
function to the size/count. The agg()
method allows you to apply a function or a list of function names to be executed along with one of the axis of the DataFrame, axis by default set to 0, which is the index (row) axis.
To get the counts and mean for each group combination use count
and mean
aggregations.
# Pandas .groupby() and using agg() Method.
df2 = df.groupby(['Courses', 'Duration']).agg(['mean', 'count'])
print(df2)
Yields below output.
# Output:
Fee Discount
mean count mean count
Courses Duration
Hadoop 35days 25000.0 1 1200.0 1
55days 24500.0 2 1750.0 2
PySpark 50days 25000.0 1 2300.0 1
Python 40days 23000.0 2 1100.0 2
Spark 30days 23500.0 2 1200.0 2
40days 25000.0 1 1200.0 1
Pandas Get Statistics Using groupby().describe()
In this section, we can get statistics using groupby().describe()
function. The describe() function is used as a summarization tool that quickly displays statistics for any variable
or group
it is applied to. The describe()
output varies depending on whether you apply it to a numeric
or character
column.
# Pandas Get Statistics Using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)
Yields below output.
# Output:
count mean std ... 50% 75% max
Courses Duration ...
Hadoop 35days 1.0 1200.0 NaN ... 1200.0 1200.0 1200.0
55days 2.0 1750.0 1060.660172 ... 1750.0 2125.0 2500.0
PySpark 50days 1.0 2300.0 NaN ... 2300.0 2300.0 2300.0
Python 40days 2.0 1100.0 141.421356 ... 1100.0 1150.0 1200.0
Spark 30days 2.0 1200.0 282.842712 ... 1200.0 1300.0 1400.0
40days 1.0 1200.0 NaN ... 1200.0 1200.0 1200.0
You can just use the built-in function count()
and mean()
follow by the DataFrame.groupby()
and describe()
function.
# Pandas DataFrame.groupby() and describe() function.
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()[['count', 'mean']]
print(df2)
Yields below output.
# Output:
count mean
Courses Duration
Hadoop 35days 1.0 1200.0
55days 2.0 1750.0
PySpark 50days 1.0 2300.0
Python 40days 2.0 1100.0
Spark 30days 2.0 1200.0
40days 1.0 1200.0
Get Statistics of Each Group by DataFrame.value_counts
To get Pandas statistics of each group by DataFrame.value_counts
. The value_counts()
function is used to get a Series containing counts of unique values.
# Get statistics by DataFrame.value_counts.
df2=df.value_counts(subset=['Courses', 'Duration'])
print(df2)
Yields below output.
# Output:
Courses Duration
Hadoop 55days 2
Python 40days 2
Spark 30days 2
Hadoop 35days 1
PySpark 50days 1
Spark 40days 1
dtype: int64
Other Examples
In this section, To get multiple stats, collapse the index, and retain column names. For example-
# Using groupby() and agg() function.
df2 = df.groupby(['Courses','Duration']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
print(df2)
Yields below output.
# Output:
Courses Duration
Hadoop 55days 2
Python 40days 2
Spark 30days 2
Hadoop 35days 1
PySpark 50days 1
Spark 40days 1
dtype: int64
Complete Examples
# Create a DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
})
df = pd.DataFrame(technologies)
print(df)
# Use DataFrame.size()
df2=df.groupby(['Courses','Duration'])
df2=df.groupby(['Courses', 'Duration']).size().reset_index(name='counts'))
# Pandas groupby() with agg() method
df2 = df.groupby(['Courses', 'Duration']).agg(['mean', 'count'])
# Pandas Get Statistics Using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
# Pandas DataFrame.groupby() and describe() function
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()[['count', 'mean']]
# Get statistics by DataFrame.value_counts
df2=df.value_counts(subset=['Courses', 'Duration'])
# Using groupby() and agg() function
df2 = df.groupby(['Courses','Duration']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
Frequently Asked Questions on Get Statistics For Each Group
Grouping data in Pandas allows for analyzing subsets of data based on specific criteria. It’s particularly useful for computing statistics or applying functions to each group separately.
Pandas offer a variety of statistical aggregation functions such as mean()
, median()
, sum()
, min()
, max()
, std()
, var()
, and more. These functions provide insights into the characteristics of each group.
You can compute multiple statistics for each group simultaneously in Pandas. One common way to achieve this is by using the agg()
function, which allows you to specify multiple aggregation functions for each column within each group.
It is possible to visualize the statistics computed for each group in Pandas. After computing the statistics using methods like groupby()
and agg()
, you can use various plotting functions available in Pandas or integrate with visualization libraries like Matplotlib or Seaborn to create visual representations of the data.
You can apply custom functions to compute statistics for each group in Pandas. This can be done by defining your own custom aggregation function and then passing it to the agg()
method within the groupby()
operation.
Conclusion
In this article, I have explained how to groupby()
single and multiple columns and get counts, size, max, min, and mean for each group from Pandas DataFrame. Also, learn how to get the stats using describe()
build-in function.
Related Articles
- How to Merge Series into Pandas DataFrame
- Change the Order of Pandas DataFrame Columns
- How to Combine Two Series into Pandas DataFrame
- Get the Count of Each Row of Pandas DataFrame
- Pandas Get Statistics For Each Group?
- How to Get Size of Pandas DataFrame?
- pandas.DataFrame.where() Examples
- Set Order of Columns in Pandas DataFrame
- How to GroupBy Index in Pandas?
- Pandas apply() function to Single & Multiple Column(s)