• Post author:
  • Post category:Pandas
  • Post last modified:May 6, 2024
  • Reading time:14 mins read
You are currently viewing Pandas Groupby Transform

Pandas Groupby Transformed can be performed by using the DataFrameGroupBy.transform() function, this function transforms the DataFrame with the specified function and returns the DataFrame having the same indexes as the original object. Are you confused about when to use the Pandas groupby transform? Well, here is a complete guide with examples on how to use groupby transform on Pandas dataFrame and Pandas Series.

Advertisements

1. Syntax and Usage of Pandas Groupby Transform

There are two variations of pandas groupby transform function. One is used for the Pandas Series and the second is used for the Pandas DataFrame. They are very much similar in the case of the parameter list they take but the difference lies just in the return value.

1.1 DataFrame GroupBy Transform Syntax

This function returns a DataFrame with the same indexes as the original DataFrame just filled with the transformed value. The value is transformed by the function passed to the DataFrameGroupBy transform(). The return value is the DataFrame.


# Syntax of DataFrameGroupBy.transform()
DataFrameGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)

1.2 Pandas Series GroupBy Transform Syntax

Same as the DataFrameGroupBy Transform, the function takes the same set of parameter but return a Pandas Series filled with the transformed values altered by the function applied. See the Below syntax for the Pandas SeriesGroupBy.transform() function.


# Syntax of DataFrameGroupBy.transform() 
SeriesGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)

1.3 Parameter List for Pandas GroupBy Transform

The parameter list for both the DataFrameGroupBy and SeriesGroupBy is the same. Below is the list of the parameter list along with the description. We will be using them in the examples followed.

  • func — Stands for function, the function that we want to apply to each group.
  • *args — Stands for positional arguments that we passed to function
  • engine — The engine that you want to execute your code on. Possible values are ‘cpython’, ‘numba’, or None. The default is ‘cpython’. If the ‘numba’ engine is chosen, the function must be a user-defined function with values and index as the first and second arguments respectively in the function signature.
  • engine_kwargs — The possible values are dict, and default is None.
  • **kwargs — Keyword arguments to be passed into func.

2. Pandas GroupBy Transform

DataFrameGroupBy.transform() function is used to transform the Pandas DataFrame on groupBy result with the specified function and returns the DataFrame having the same indexes as the original object. So to perform the transform() function, first you need to perform the Pandas groupBy().


# Import
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","PySpark","Pandas"],
    'Fee' :[20000,22000,22000,30000],
    'hours':[30,35,30,35]
              }
# Create dataframe
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

From our example above, let’s say we want to buy a course based on hours and Fees. We want to find out which course category has maximum hours and then find out how much each course category fee is.


# Transform Groupby Object
transformed_df=df.groupby('hours').transform(lambda x: x.sum())
print(df)
print(transformed_df)

Yields below output


# Output:
         Courses    Fee
0   SparkPySpark  42000
1  PySparkPandas  52000
2   SparkPySpark  42000
3  PySparkPandas  52000

3. Pandas GroupBy Transform on Multiple Columns

The most important thing about the Pandas GroupBy Transform is that it can only be applied to a single column at once and can not be applied to multiple columns at once. If you want to access multiple columns at once and do computation on it you can pre-compute the DataFrame and then apply the transform().

In the below example, we want to find out Courses that have a Fee value aligned with our budget. We can not achieve the result by merely using the Pandas GroupBy Transform but will need the pre-processing on the DataFrame. See the following example.


# Transform Groupby Object
df['Perfect Course'] = (
    (df['Fee']  <= df['Budget'])
    .groupby(df['Courses']).transform('sum')
)
print(df)

Yields the following output.Where you can see that the perfect course for the Spark is 0 because our budget is not enough for that course. so this can be a use case for the transform function.

pandas groupby transform
The output of the code shows the courses that fit with our budget

In the above example, you have seen that we can not achieve the result directly by using the Pandas function but instead, we have to do pre-processing on DataFrame.

4. Different method of passing Function to GroupBy Transform

One of the most important parameters that we discussed above is function. The function that we want to apply to our DataFrame or Pandas Series can be passed in three different methods. In most cases, you will be using a predefined function but it is always good to know about the possibilities.

4.1 Passing a Pre-Defined function to GroupBy Transform

In the above example, we have seen how we can use the sum(), mean(), and median() functions with Pandas GroupBy Transform function. You can either pass the name of the function directly as a string parameter.


# Transform using the sum function
df['Perfect Course'] = df.groupby(['Courses']).hours.transform('sum')
print(df)

Yields below output. Look at the Perfect Course Column and PySpark Row.


# Output:
  Courses    Fee  Budget  hours  Perfect Course
0    Spark  20000    2000     30              30
1  PySpark  22000   22000     35              65
2  PySpark  22000   24000     30              65
3   Pandas  30000   30000     35              35

It is important to mention it here that the built-in Pandas function can only be applied to a 1D array and will give you an error of “Expected a 1D array, got an array with shape (x, y)” if applied to a dataFrame.

4.2 Passing a Lambda function

While you have seen the limitation of the built-in functions that you can apply to a DataFrame using pandas GroupBy Transform. Pandas allow you to use the Lambda function instead of the built-in functions. You can use the lambda expression in the following way.


# Transform Groupby Object
transformed_df=df.groupby('Courses').transform(lambda x: x.sum() - x.mean())
print(transformed_df)

Yields, the following output. X in the lambda function is the pandas series. And we can apply almost any of the Pandas series functions on it.


# Output:
       Fee  hours
0      0.0    0.0
1  22000.0   32.5
2  22000.0   32.5
3      0.0    0.0

4.3 Passing a User-Defined Function

We can use a user-defined function. It is really helpful in cases where we have a function that is used a lot. We can call a utility function that is defined for the general purpose. In the following example, you can see we have passed a parameter of 45 to the function to be added to each value of the DataFrame.


# Use user defined function 
def add_value(df_col,value):
    '''we want to add value to each item in col'''
    return df_col+value

# Create dataframe
df = pd.DataFrame(technologies)
# Transform using the sum function
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)

Yields the following output. Where it adds the passed value to the transformed dataframe.


# Output:
     Fee  Budget  hours
0  20045    2045     75
1  22045   22045     80
2  22045   24045     75
3  30045   30045     80

5. GroupBy Aggregate Vs GroupBy Transform

The main difference between the Groupby aggregate() and groupby Transform() is that the Transform() function broadcasts the values to the complete dataFrame and returns the dataFrame with the same cells but Transformed values. While the aggregate() function returns the aggregate value of the specific columns.


def add_value(df_col,value):
    return df_col+value


# Using the transform() function
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)
# Using the aggregate() function
agg_df=df.groupby('Courses').Fee.aggregate(add_value,45)
print(agg_df)

Yields below output.


# Output:
     Fee  Budget  hours
0  20045    2045     75
1  22045   22045     80
2  22045   24045     75
3  30045   30045     80
Courses
Pandas              30045
PySpark    [22045, 22045]
Spark               20045
Name: Fee, dtype: object

6. Pandas GroupBy Apply Vs GroupBy Transform

We have a lot of similarities in the Pandas groupBy apply(), apply Map(), and GroupBy Transform(). However, the key difference between the apply() and transform() functions are here:

  • Pandas GroupBy apply function passes the entire DataFrame, but groupBy transform passes each column individually as a Series.
  • The return value of the Pandas GroupBy Transform is Either a Series or DataFrame but the apply() function can return any iterable object.
  • Pandas groupby transform works on just one Series at a time and groupby apply() works on the entire DataFrame at once. This means you can not access multiple columns while using the groupBy transform function.

def add_cols_value(df_col,value):
    '''this work only on apply() funtion'''
    return df_col['Fee']+df['hours']+value

def add_value(df_col,value):
    return df_col+value


# The add_value funciton can be used by both apply() and transform()
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)
# This fucntion can only be applied to apply() function
app_df=df.groupby('Courses').apply(add_cols_value,45)
print(app_df)

Yields, the following output.


# Output:
     Fee  Budget  hours
0  20045    2045     75
1  22045   22045     80
2  22045   24045     75
3  30045   30045     80
               0        1        2        3
Courses                                    
Pandas       NaN      NaN      NaN  30080.0
PySpark      NaN  22080.0  22075.0      NaN
Spark    20075.0      NaN      NaN      NaN

Conclusion

In this article, I have explained the different use cases and examples of the Pandas groupby transform function. DataFrameGroupBy.transform() function transforms the DataFrame with the specified function and returns the DataFrame having the same indexes as the original object.