Similar to the SQL GROUP BY clause Pandas DataFrame.groupby()
function is used to collect identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.
In Pandas, you can use groupby()
with the combination of sum()
, count()
, pivot()
, transform(), aggregate()
, and many more methods to perform various operations on grouped data. In this article, I will cover how to group by a single column, or multiple columns by using groupby()
with examples.
1. Pandas groupby() Syntax
Below is the syntax of the groupby()
function, this function takes several params that are explained below and returns DataFrameGroupBy
object that contains information about the groups.
# Syntax of DataFrame.groupby()
DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=<no_default>,
observed=False, dropna=True)
by
– List of column names to group byaxis
– Default to 0. It takes 0 or ‘index’, 1 or ‘columns’level
– Used with MultiIndex.as_index
– sql style grouped output.sort
– Default to True. Specify whether to sort after the groupgroup_keys
– add group keys or notsqueeze
– deprecated in new versionsobserved
– This only applies if any of the groupers are Categoricals.dropna
– Default to False. Use True to drop None/Nan on sorry keys
To explain several examples of how to perform groupby, first, let’s create a simple Pandas DataFrame with a combination of string and numeric columns.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
})
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
2. Pandas groupby() Example
As I said above groupby()
function returns DataFrameGroupBy
object after collecting the identical data into groups from pandas DataFrame. To perform several operations on DataFrameGroupby object
using sum()
, mean()
e.t.c.
Let’s apply the groupby() function along with the sum() function to perform the sum operation on grouped data.
# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print("Get sum of grouped data:\n", df2)
Yields below output.
3. Pandas groupby() on Two or More Columns
Most of the time we would need to perform groupby on multiple columns of DataFrame, you can do this by passing a list of column labels you want to perform groupby on.
# Group by multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print("Get sum of groupby multiple columns:\n", df2)
Yields below output
# Output:
# Get sum of groupby multiple columns:
Fee Discount
Courses Duration
Hadoop 35days 25000 0.0
55days 23000 1000.0
NA 40days 1500 0.0
Pandas 60days 26000 2500.0
PySpark 50days 25000 2300.0
Python 40days 24000 1200.0
50days 22000 1600.0
Spark 30days 47000 2400.0
4. Add Index to the grouped data
By default groupby()
function doesn’t return the row Index, you can add the index using the DataFrame.reset_index() method.
Related: You can group the Pandas DataFrame by index.
# Add Row Index to the group by result
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print("After adding index to DataFrame:\n", df2)
Yields below output
# Output:
# After adding index to DataFrame:
Courses Duration Fee Discount
0 Hadoop 35days 25000 0.0
1 Hadoop 55days 23000 1000.0
2 NA 40days 1500 0.0
3 Pandas 60days 26000 2500.0
4 PySpark 50days 25000 2300.0
5 Python 40days 24000 1200.0
6 Python 50days 22000 1600.0
7 Spark 30days 47000 2400.0
5. Drop NA /None/Nan (on group key) from the Result
You can also choose whether to include NA/None/Nan in group keys or not by setting dropna
parameter. By default the value of dropna
set to True
. So, it does not include None/Nan values on the group keys set dropna=False
parameter.
# Drop rows that have None/Nan on group keys
df2=df.groupby(by=['Courses'], dropna=False).sum()
print(df2)
6. Sort groupby() result by Group Key
By default groupby()
function sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False
param.
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
If you want to sort keys in descending order, use them below.
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
In case you want to sort by a different key, you can do so by using DataFrame.apply() function.
# Using apply() & lambda
df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))
7. Apply More Aggregations
You can also compute multiple aggregations at the same time in grouped data simply bypassing the list of aggregate functions to the aggregate().
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print("After applying multiple aggregations on grouped data:\n", result)
Yields below output.
# Output:
# After applying multiple aggregations on grouped data:
min max
Courses
Hadoop 23000 25000
NA 1500 1500
Pandas 26000 26000
PySpark 25000 25000
Python 22000 24000
Spark 22000 25000
The above example calculates min
and max
on the Fee
column. To extend this to compute different aggregations on different columns. Here, calculates the count on the Duration
grouped column and calculates min
and max
on the Fee
grouped column
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print("After applying multiple aggregations on grouped data:\n", result)
Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.
# Output:
# After applying multiple aggregations on grouped data:
Duration Fee
count min max
Courses
Hadoop 2 23000 25000
NA 1 1500 1500
Pandas 1 26000 26000
PySpark 1 25000 25000
Python 2 22000 24000
Spark 2 22000 25000
8. Complete Example of groupby() in DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
})
df = pd.DataFrame(technologies)
print(df))
# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)
# Group by on multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)
# Set Index on group by results
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)
# Using groupby on single column in pandas
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
# Ignore sorting on group by key
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
# Sort group key on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Frequently Asked Questions on Pandas groupby()
groupby()
is a powerful function in pandas that is used for grouping data based on some criteria. It enables you to split a DataFrame into groups based on one or more columns and then apply a function (such as aggregation, transformation, or filtering) to each group independently.
The basic syntax of groupby()
involves selecting a column or columns to group by and then applying an aggregation function.
Common aggregation functions used with the groupby()
method in pandas are functions that summarize or aggregate data within each group.
To apply custom functions with groupby()
in pandas, you can use either the apply()
method or the agg()
method.
You can perform transformations within each group using the transform()
method with groupby()
in pandas. The transform()
method is particularly useful when you want to maintain the original shape of the DataFrame but apply a transformation to each group independently.
You can sort the results of groupby()
in pandas using the sort_values()
method. For example, grouped_data
contains the mean value for each category. The sort_values()
method is then used to sort the result in descending order based on the mean values.
Conclusion
In this article, I have covered Pandas groupby()
syntax and several examples of how to collect identical data into groups. I hope you have learned how to run group by on several columns, sort grouped data, ignore null values, and many more with examples.
Happy Learning !!
Related Articles
- Pandas – Get Column Index For Column Name
- Pandas Insert List into Cell of DataFrame
- How to Shuffle Pandas Rows Randomly
- Pandas groupby() and count() with Examples
- Convert groupby() output from series to DatatFrame
- Pandas Group Rows into List Using groupby()
- Pandas Drop Last Column From DataFrame
- Pandas Percentage Total With Groupby
- Convert groupby() output from series to DatatFrame
- Pandas DataFrame – Different Ways to Iterate Over Rows
- Different Ways to Get Row Count in Pandas DataFrame
This is great, thanks so much. The reset_index operation was what I was looking for…