Similar to the SQL GROUP BY clause pandas DataFrame.groupby()
function is used to collect identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.
In pandas, you can use groupby()
with the combination of sum()
, pivot()
, transform()
, aggregate() and many more methods. In this article, I will cover how to group by a single column, or multiple columns by using aggregations with examples.
1. Pandas groupby() Syntax
Below is the syntax of the groupby()
function, this function takes several params that are explained below and returns DataFrameGroupBy
object that contains information about the groups.
# Syntax of DataFrame.groupby()
DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=<no_default>,
observed=False, dropna=True)
by
– List of column names to group byaxis
– Default to 0. It takes 0 or ‘index’, 1 or ‘columns’level
– Used with MultiIndex.as_index
– sql style grouped otput.sort
– Default to True. Specify whether to sort after groupgroup_keys
– add group keys or notsqueeze
– depricated in new versionsobserved
– This only applies if any of the groupers are Categoricals.dropna
– Default to False. Use True to drop None/Nan on sory keys
In order to explain several examples of how to perform group by, first, let’s create a simple DataFrame with the combination of string and numeric columns.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 22000 30days 1000.0
1 PySpark 25000 50days 2300.0
2 Hadoop 23000 55days 1000.0
3 Python 24000 40days 1200.0
4 Pandas 26000 60days 2500.0
5 Hadoop 25000 35days NaN
6 Spark 25000 30days 1400.0
7 Python 22000 50days 1600.0
8 NA 1500 40days 0.0
2. Pandas groupby() Example
As I said above groupby()
function returns DataFrameGroupBy
object after collecting the identical data into groups from pandas DataFrame. This object contains several methods (sum()
, mean()
e.t.c) that can be used to aggregate the grouped rows.
# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)
Yields below output.
Fee Discount
Courses
Hadoop 48000 1000.0
NA 1500 0.0
Pandas 26000 2500.0
PySpark 25000 2300.0
Python 46000 2800.0
Spark 47000 2400.0
3. Pandas groupby() on Two or More Columns
Most of the time we would need to perform groupby on multiple columns of DataFrame, you can do this by passing a list of column labels you wanted to perform group by on.
# Group by multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)
Yields below output
Fee Discount
Courses Duration
Hadoop 35days 25000 0.0
55days 23000 1000.0
NA 40days 1500 0.0
Pandas 60days 26000 2500.0
PySpark 50days 25000 2300.0
Python 40days 24000 1200.0
50days 22000 1600.0
Spark 30days 47000 2400.0
4. Add Index to the grouped data
By default groupby()
result doesn’t include row Index, you can add the index using DataFrame.reset_index() method.
# Add Row Index to the group by result
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)
Yields below output
Courses Duration Fee Discount
0 Hadoop 35days 25000 0.0
1 Hadoop 55days 23000 1000.0
2 NA 40days 1500 0.0
3 Pandas 60days 26000 2500.0
4 PySpark 50days 25000 2300.0
5 Python 40days 24000 1200.0
6 Python 50days 22000 1600.0
7 Spark 30days 47000 2400.0
5. Drop NA /None/Nan (on group key) from Result
You can also choose whether to include NA/None/Nan in group keys or not by setting dropna
parameter. By default the value of dropna
set to True
. so to not to include None/Nan values on group keys set dropna=False
parameter.
# Drop rows that have None/Nan on group keys
df2=df.groupby(by=['Courses'], dropna=False).sum()
print(df2)
6. Sort groupby() result by Group Key
By default groupby() function sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False
param.
# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
If you wanted to sort key descending order, use below.
# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
In case you wanted to sort by a different key, you can do so by using DataFrame.apply() function.
# Using apply() & lambda
df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))
7. Apply More Aggregations
You can also compute several aggregations at the same time in pandas by passing the list of agg functions to the aggregate()
.
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
Yields below output.
min max
Courses
Hadoop 23000 25000
NA 1500 1500
Pandas 26000 26000
PySpark 25000 25000
Python 22000 24000
Spark 22000 25000
The above example calculates min
and max
on the Fee
column. Let’s extend this to compute different aggregations on different columns.
Note that applying several aggregations to a single column in pandas DataFrame will result in a MultiIndex.
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Yields below output. Notice that this creates MultiIndex. Working with multi-indexed columns is not easy so I’d recommend flattening by renaming the columns.
Duration Fee
count min max
Courses
Hadoop 2 23000 25000
NA 1 1500 1500
Pandas 1 26000 26000
PySpark 1 25000 25000
Python 2 22000 24000
Spark 2 22000 25000
8. Complete Example of groupby() in DataFrame
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
})
df = pd.DataFrame(technologies)
print(df))
# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)
# Group by on multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)
# Set Index on group by results
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)
# Using groupby on single column in pandas
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)
# Ignore sorting on group by key
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)
# Sort group key on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)
# Groupby & multiple aggregations
result = df.groupby('Courses')['Fee'].aggregate(['min','max'])
print(result)
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate({'Duration':'count','Fee':['min','max']})
print(result)
Conclusion
In this article, I have covered Pandas groupby() syntax and several examples of how to collect identical data into groups. I hope you have learned how to run group by on several columns, sort grouped data, ignore null values, and many more with examples.
Happy Learning !!
Related Articles
- Pandas DataFrame – Different Ways to Iterate Over Rows
- Different Ways to Get Row Count in Pandas DataFrame
- How to Shuffle Pandas Rows Randomly
- Pandas GroupBy Multiple Columns Explained
- How to GroupBy Index in Pandas?
- PySpark GroupBy Count – Explained
- Pandas Groupby Sort within Groups
- Pandas groupby() and count() with Examples
This is great, thanks so much. The reset_index operation was what I was looking for…