Pandas Groupby Transformed can be performed by using the DataFrameGroupBy.transform()
function, this function transforms the DataFrame with the specified function and returns the DataFrame having the same indexes as the original object. Are you confused about when to use the Pandas groupby transform? Well, here is a complete guide with examples on how to use groupby transform on Pandas dataFrame and Pandas Series.
Table of contents
1. Syntax and Usage of Pandas Groupby Transform
There are two variations of pandas groupby transform function. One is used for the Pandas Series and the second is used for the Pandas DataFrame. They are very much similar in the case of the parameter list they take but the difference lies just in the return value.
1.1 DataFrame GroupBy Transform Syntax
This function returns a DataFrame with the same indexes as the original DataFrame just filled with the transformed value. The value is transformed by the function passed to the DataFrameGroupBy transform()
. The return value is the DataFrame.
# Syntax of DataFrameGroupBy.transform()
DataFrameGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)
1.2 Pandas Series GroupBy Transform Syntax
Same as the DataFrameGroupBy Transform, the function takes the same set of parameter but return a Pandas Series filled with the transformed values altered by the function applied. See the Below syntax for the Pandas SeriesGroupBy.transform() function.
# Syntax of DataFrameGroupBy.transform()
SeriesGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)
1.3 Parameter List for Pandas GroupBy Transform
The parameter list for both the DataFrameGroupBy and SeriesGroupBy is the same. Below is the list of the parameter list along with the description. We will be using them in the examples followed.
- func — Stands for function, the function that we want to apply to each group.
- *args — Stands for positional arguments that we passed to function
- engine — The engine that you want to execute your code on. Possible values are ‘cpython’, ‘numba’, or None. The default is ‘cpython’. If the ‘numba’ engine is chosen, the function must be a user-defined function with values and index as the first and second arguments respectively in the function signature.
- engine_kwargs — The possible values are dict, and default is None.
- **kwargs — Keyword arguments to be passed into func.
2. Pandas GroupBy Transform
DataFrameGroupBy.transform() function is used to transform the Pandas DataFrame on groupBy result with the specified function and returns the DataFrame having the same indexes as the original object. So to perform the transform() function, first you need to perform the Pandas groupBy().
# Import
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","PySpark","Pandas"],
'Fee' :[20000,22000,22000,30000],
'hours':[30,35,30,35]
}
# Create dataframe
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
From our example above, let’s say we want to buy a course based on hours and Fees. We want to find out which course category has maximum hours and then find out how much each course category fee is.
# Transform Groupby Object
transformed_df=df.groupby('hours').transform(lambda x: x.sum())
print(df)
print(transformed_df)
Yields below output
# Output:
Courses Fee
0 SparkPySpark 42000
1 PySparkPandas 52000
2 SparkPySpark 42000
3 PySparkPandas 52000
3. Pandas GroupBy Transform on Multiple Columns
The most important thing about the Pandas GroupBy Transform is that it can only be applied to a single column at once and can not be applied to multiple columns at once. If you want to access multiple columns at once and do computation on it you can pre-compute the DataFrame and then apply the transform().
In the below example, we want to find out Courses that have a Fee value aligned with our budget. We can not achieve the result by merely using the Pandas GroupBy Transform but will need the pre-processing on the DataFrame. See the following example.
# Transform Groupby Object
df['Perfect Course'] = (
(df['Fee'] <= df['Budget'])
.groupby(df['Courses']).transform('sum')
)
print(df)
Yields the following output.Where you can see that the perfect course for the Spark is 0 because our budget is not enough for that course. so this can be a use case for the transform function.
In the above example, you have seen that we can not achieve the result directly by using the Pandas function but instead, we have to do pre-processing on DataFrame.
4. Different method of passing Function to GroupBy Transform
One of the most important parameters that we discussed above is function. The function that we want to apply to our DataFrame or Pandas Series can be passed in three different methods. In most cases, you will be using a predefined function but it is always good to know about the possibilities.
4.1 Passing a Pre-Defined function to GroupBy Transform
In the above example, we have seen how we can use the sum(), mean(), and median() functions with Pandas GroupBy Transform function. You can either pass the name of the function directly as a string parameter.
# Transform using the sum function
df['Perfect Course'] = df.groupby(['Courses']).hours.transform('sum')
print(df)
Yields below output. Look at the Perfect Course Column and PySpark Row.
# Output:
Courses Fee Budget hours Perfect Course
0 Spark 20000 2000 30 30
1 PySpark 22000 22000 35 65
2 PySpark 22000 24000 30 65
3 Pandas 30000 30000 35 35
It is important to mention it here that the built-in Pandas function can only be applied to a 1D array and will give you an error of “Expected a 1D array, got an array with shape (x, y)” if applied to a dataFrame.
4.2 Passing a Lambda function
While you have seen the limitation of the built-in functions that you can apply to a DataFrame using pandas GroupBy Transform. Pandas allow you to use the Lambda function instead of the built-in functions. You can use the lambda expression in the following way.
# Transform Groupby Object
transformed_df=df.groupby('Courses').transform(lambda x: x.sum() - x.mean())
print(transformed_df)
Yields, the following output. X in the lambda function is the pandas series. And we can apply almost any of the Pandas series functions on it.
# Output:
Fee hours
0 0.0 0.0
1 22000.0 32.5
2 22000.0 32.5
3 0.0 0.0
4.3 Passing a User-Defined Function
We can use a user-defined function. It is really helpful in cases where we have a function that is used a lot. We can call a utility function that is defined for the general purpose. In the following example, you can see we have passed a parameter of 45 to the function to be added to each value of the DataFrame.
# Use user defined function
def add_value(df_col,value):
'''we want to add value to each item in col'''
return df_col+value
# Create dataframe
df = pd.DataFrame(technologies)
# Transform using the sum function
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)
Yields the following output. Where it adds the passed value to the transformed dataframe.
# Output:
Fee Budget hours
0 20045 2045 75
1 22045 22045 80
2 22045 24045 75
3 30045 30045 80
5. GroupBy Aggregate Vs GroupBy Transform
The main difference between the Groupby aggregate() and groupby Transform() is that the Transform() function broadcasts the values to the complete dataFrame and returns the dataFrame with the same cells but Transformed values. While the aggregate() function returns the aggregate value of the specific columns.
def add_value(df_col,value):
return df_col+value
# Using the transform() function
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)
# Using the aggregate() function
agg_df=df.groupby('Courses').Fee.aggregate(add_value,45)
print(agg_df)
Yields below output.
# Output:
Fee Budget hours
0 20045 2045 75
1 22045 22045 80
2 22045 24045 75
3 30045 30045 80
Courses
Pandas 30045
PySpark [22045, 22045]
Spark 20045
Name: Fee, dtype: object
6. Pandas GroupBy Apply Vs GroupBy Transform
We have a lot of similarities in the Pandas groupBy apply(), apply Map(), and GroupBy Transform(). However, the key difference between the apply() and transform() functions are here:
- Pandas GroupBy apply function passes the entire DataFrame, but groupBy transform passes each column individually as a Series.
- The return value of the Pandas GroupBy Transform is Either a Series or DataFrame but the apply() function can return any iterable object.
- Pandas groupby transform works on just one Series at a time and groupby apply() works on the entire DataFrame at once. This means you can not access multiple columns while using the groupBy transform function.
def add_cols_value(df_col,value):
'''this work only on apply() funtion'''
return df_col['Fee']+df['hours']+value
def add_value(df_col,value):
return df_col+value
# The add_value funciton can be used by both apply() and transform()
transformed_df = df.groupby('Courses').transform(add_value,45)
print(transformed_df)
# This fucntion can only be applied to apply() function
app_df=df.groupby('Courses').apply(add_cols_value,45)
print(app_df)
Yields, the following output.
# Output:
Fee Budget hours
0 20045 2045 75
1 22045 22045 80
2 22045 24045 75
3 30045 30045 80
0 1 2 3
Courses
Pandas NaN NaN NaN 30080.0
PySpark NaN 22080.0 22075.0 NaN
Spark 20075.0 NaN NaN NaN
Conclusion
In this article, I have explained the different use cases and examples of the Pandas groupby transform function. DataFrameGroupBy.transform()
function transforms the DataFrame with the specified function and returns the DataFrame having the same indexes as the original object.