Pandas groupby() Explained With Examples

Similar to the SQL GROUP BY clause, panda.DataFrame.groupBy() function is used to collect the identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

In Pandas, you can use groupby() with the combination of sum(), pivot(), transform(), and aggregate() methods. In this article, I will cover how to group by a single column, multiple columns, by using aggregations with examples.

1. Pandas DataFrame groupby() Syntax

Below is the syntax of groupby() method, this function takes several params that are explained below and returns GroupBy objects that contain information about the groups. GropupBy


DataFrame.groupby(by=None, axis=0, level=None, as_index=True, 
       sort=True, group_keys=True, squeeze=<no_default>, 
       observed=False, dropna=True)
</no_default>
  • by – List of column names to group by
  • axis – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
  • level – Used with MultiIndex.
  • as_index – sql style grouped otput.
  • sort – Default to True. Specify whether to sort after group
  • group_keys – add group keys or not
  • squeeze – depricated in new versions
  • observed – This only applies if any of the groupers are Categoricals.
  • dropna – Default to False. Use True to drop None/Nan on sory keys

In order to explain several examples of how to perform pandas groupby(), first, let’s create a simple DataFrame with the combination of string and numeric columns.


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30day','50days','55days', '40days','60days','35day','30day','50days','2days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000    30day    1000.0
1  PySpark  25000   50days    2300.0
2   Hadoop  23000   55days    1000.0
3   Python  24000   40days    1200.0
4   Pandas  26000   60days    2500.0
5   Hadoop  25000    35day       NaN
6    Spark  25000    30day    1400.0
7   Python  22000   50days    1600.0
8       NA   1500    2days       0.0

2. Pandas gropby() of Single Column

As I said above groupby() method returns GroupBy objects after grouping the data. This object contains several methods (sum(), mean() e.t.c) that can be used to aggregate the grouped rows.


# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)

Yields below output.


           Fee  Discount
Courses                 
Hadoop   48000    1000.0
NA        1500       0.0
Pandas   26000    2500.0
PySpark  25000    2300.0
Python   46000    2800.0
Spark    47000    2400.0

3. Pandas groupby() on Multiple Columns

Most of the time we would need to perform group by on multiple columns, you can do this in pandas just using groupby() method and passing a list of column labels you wanted to perform group by on.


# Group by multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)

Yields below output


                   Fee  Discount
Courses Duration                 
Hadoop  35day     25000       0.0
        55days    23000    1000.0
NA      2days      1500       0.0
Pandas  60days    26000    2500.0
PySpark 50days    25000    2300.0
Python  40days    24000    1200.0
        50days    22000    1600.0
Spark   30day     47000    2400.0

4. Add Index to the Result of gropby DataFrame

By default groupby() result doesn’t include row Index, you can add the index using reset_index() method.


# Add Row Index to the group by result
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)

Yields below output


   Courses Duration    Fee  Discount
0   Hadoop    35day  25000       0.0
1   Hadoop   55days  23000    1000.0
2       NA    2days   1500       0.0
3   Pandas   60days  26000    2500.0
4  PySpark   50days  25000    2300.0
5   Python   40days  24000    1200.0
6   Python   50days  22000    1600.0
7    Spark    30day  47000    2400.0

5. Drop NA /None/Nan (on group key) from Group By Result

You can also choose whether to include NA/None/Nan in group keys or not by setting dropna parameter. By default the value of dropna set to True. so to not to include None/Nan values on group keys set dropna=False parameter.


# Drop rows that have None/Nan on group keys
df2=df.groupby(by=['Courses'], dropna=False).sum()
print(df2)

6. Sort groupby() result by Group Key

By default groupby() method sorts results by group key hence it will take additional time, if you have a performance issue and don’t want to sort the group by the result, you can turn this off by using the sort=False param.


# Remove sorting on grouped results
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

If you wanted to sort key descending order, use below.


# Sorting group keys on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

In case if you wanted to sort by a different key, you use something like below.


df.groupby('Courses').apply(lambda x: x.sort_values('Fee'))

7. Complete Example of using groupby() in Pandas DataFrame


import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30day','50days','55days', '40days','60days','35day','30day','50days','2days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print(df)

# Use groupby() to compute the sum
df2 =df.groupby(['Courses']).sum()
print(df2)

# Group by on multiple columns
df2 =df.groupby(['Courses', 'Duration']).sum()
print(df2)

# Set Index on group by results
df2 = df.groupby(['Courses','Duration']).sum().reset_index()
print(df2)

# Using groupby on single column in pandas 
df2 = df.groupby(['Courses'])['Fee'].sum().reset_index()
print(df2)

# Ignore sorting on group by key
df2=df.groupby(by=['Courses'], sort=False).sum()
print(df2)

# Sort group key on descending order
groupedDF = df.groupby('Courses',sort=False).sum()
sortedDF=groupedDF.sort_values('Courses', ascending=False)
print(sortedDF)

Conclusion

In this article, I have covered pandas groupby() syntax and several examples of how to group your data. I hope you have learned how to run group by on multiple columns, sort grouped data, ignoring null values, and many more with examples.

Happy Learning !!

You May Also Like

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

Pandas groupby() Explained With Examples