The DataFrame.group_by()
method in polars is used to group the DataFrame by one or more columns (or expressions) and allows for applying aggregations or transformations to each group. It functions similarly to SQL’s GROUP BY
but is optimized for better performance, particularly with large datasets.
In Polars, you can use group_by()
with the combination of sum()
, mean()
, max()
, min()
, aggregate()
, and many more methods to perform various operations on grouped data. In this article, I will explain the Polars DataFrame group_by()
method by using its syntax, parameters, and usage, and how to group by a single column, or multiple columns by using groupby()
with examples.
Key Points –
- The
group_by()
method groups rows in a DataFrame based on the unique values of one or more specified columns or expressions. - After grouping, you can apply aggregation functions such as
sum()
,mean()
,min()
,max()
,count()
, and more on the grouped data. - You can group data by a single column or multiple columns by providing a list of column names.
- Supports grouping by expressions, allowing more complex operations, such as applying functions or transformations during the grouping process.
- When set to
True
, this flag preserves the original row order within groups, which can be useful in certain analyses. - Allows custom aggregations by using
agg()
with column expressions, enabling flexible data transformations. - Designed for high performance,
groupby()
operations in Polars are efficient and suitable for large-scale datasets.
Polars DataFrame.groupby() Introduction
Let’s know the syntax of the Polars DataFrame groupby() method.
# Syntax of groupby()
DataFrame.group_by(
by: IntoExpr | Iterable[IntoExpr], # Column(s) or expression(s) to group by
maintain_order: bool = False, # Whether to maintain the original order of rows
**named_by: IntoExpr, # Additional named aggregation expressions (optional)
) → GroupBy
Parameters of the Polars DataFrame.groupby()
Following are the parameters of the polars DataFrame.groupby() method.
by
– Specifies the column(s) or expression(s) to group by. This can be a single column or a list of columns (or expressions).maintain_order
– A boolean flag that, when set toTrue
, keeps the original row order within each group. Default:False
.**named_by
– Optional named expressions that are used in aggregation operations. You can pass additional aggregations for grouped data. For example,"column_name"
or an aggregation expression likepl.col("col_name").sum()
.
Return Value
It returns a GroupBy
object, which can then be used to apply aggregation functions (e.g., sum()
, mean()
, etc.) using the .agg()
method.
Usage of Polars DataFrame.group_by()
The groupby()
method is used to group rows in a DataFrame based on one or more columns or expressions, allowing for efficient aggregation and transformation of grouped data.
In order to explain several examples of how to perform polars group by on DataFrame, first, let’s create a DataFrame.
import polars as pl
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
})
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
Here’s how you can group by one column and compute the grouped sum of another column using Polars.
# Group by 'Couses' and calculate the sum of 'Fee'
df2 = df.group_by("Courses").agg(pl.col("Fee").sum())
print("Get sum of grouped data:\n", df2)
Here,
df.group_by("Courses")
groups the DataFrame by unique values in the column"Courses"
. Each unique course becomes a group.agg(pl.col("Fee").sum())
calculates the total of the values in the “Fee” column for each group. This operation combines all rows within the same group into a single row, with the “Fee” column displaying the summed value.- The output is a new DataFrame (
df2
) where each row represents a unique course, along with the total fee for that course.
The group_by()
function groups the DataFrame based on a specific column, while the agg()
function performs aggregate operations on the grouped data. In your case, you are grouping the DataFrame by the “Courses” column and aggregating the “Discount” column.
# Use grouping with maintain_order
result = df.group_by("Courses", maintain_order=True).agg(pl.col("Discount"))
print(result)
# Output:
# shape: (6, 2)
┌─────────┬──────────────┐
│ Courses ┆ Discount │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════════╪══════════════╡
│ Spark ┆ [1000, 1400] │
│ PySpark ┆ [2300] │
│ Hadoop ┆ [1000, null] │
│ Python ┆ [1200, 1600] │
│ Pandas ┆ [2500] │
│ NA ┆ [0] │
└─────────┴──────────────┘
In the above example, the maintain_order=True
argument ensures that the order of the rows in the output remains the same as in the original DataFrame. The aggregation of the “Discount” column for each group will result in a list of values for each “Courses” group.
Polars group_by() on Two or More Columns
Alternatively, you can group data by two or more columns by passing a list of column names to the group_by()
function. After grouping, you can apply aggregations, transformations, or other operations to the grouped data.
# Group by both 'Courses' and 'Duration', and aggregate using sum
result = df.group_by(['Courses', 'Duration']).agg(
pl.col('Fee').sum().alias('Total'))
print(result)
# Output:
# shape: (8, 3)
┌─────────┬──────────┬───────────┐
│ Courses ┆ Duration ┆ Total │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪══════════╪═══════════╡
│ Spark ┆ 30days ┆ 47000 │
│ NA ┆ 40days ┆ 1500 │
│ Hadoop ┆ 55days ┆ 23000 │
│ PySpark ┆ 50days ┆ 25000 │
│ Python ┆ 40days ┆ 24000 │
│ Python ┆ 50days ┆ 22000 │
│ Hadoop ┆ 35days ┆ 25000 │
│ Pandas ┆ 60days ┆ 26000 │
└─────────┴──────────┴───────────┘
Here,
- The
groupby()
function is used to group the DataFrame by both theCourses
andDuration
columns. - The
agg()
function is then applied to perform an aggregation on theFee
column, calculating the sum for each group. - The result shows the sum of
Fee
for each combination ofCourses
andDuration
.
GroupBy with Multiple Aggregations
You can perform groupby()
operations with multiple aggregations on a polars DataFrame. By passing multiple aggregation functions within the agg()
method, you can compute a variety of summary statistics for each group in a single step.
# Group by both 'Courses' and 'Duration', apply multiple aggregations
result = df.group_by(['Courses', 'Duration']).agg([
pl.col('Fee').sum().alias('Total_Fee'),
pl.col('Fee').mean().alias('Mean_Fee'),
pl.col('Fee').max().alias('Max_Fee'),
pl.col('Discount').min().alias('Min_Discount'),
pl.col('Discount').max().alias('Max_Discount')
])
print(result)
# Output:
# shape: (8, 7)
┌─────────┬──────────┬───────────┬──────────┬─────────┬──────────────┬──────────────┐
│ Courses ┆ Duration ┆ Total_Fee ┆ Mean_Fee ┆ Max_Fee ┆ Min_Discount ┆ Max_Discount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪══════════╪═══════════╪══════════╪═════════╪══════════════╪══════════════╡
│ Python ┆ 50days ┆ 22000 ┆ 22000.0 ┆ 22000 ┆ 1600 ┆ 1600 │
│ PySpark ┆ 50days ┆ 25000 ┆ 25000.0 ┆ 25000 ┆ 2300 ┆ 2300 │
│ Hadoop ┆ 55days ┆ 23000 ┆ 23000.0 ┆ 23000 ┆ 1000 ┆ 1000 │
│ Pandas ┆ 60days ┆ 26000 ┆ 26000.0 ┆ 26000 ┆ 2500 ┆ 2500 │
│ Spark ┆ 30days ┆ 47000 ┆ 23500.0 ┆ 25000 ┆ 1000 ┆ 1400 │
│ Hadoop ┆ 35days ┆ 25000 ┆ 25000.0 ┆ 25000 ┆ null ┆ null │
│ Python ┆ 40days ┆ 24000 ┆ 24000.0 ┆ 24000 ┆ 1200 ┆ 1200 │
│ NA ┆ 40days ┆ 1500 ┆ 1500.0 ┆ 1500 ┆ 0 ┆ 0 │
└─────────┴──────────┴───────────┴──────────┴─────────┴──────────────┴──────────────┘
Here,
- The
groupby()
function groups the DataFrame by both theCourses
andDuration
columns. - The
agg()
function is used to perform multiple aggregation operations. Thesum()
,mean()
, andmax()
functions are applied to theFee
column, while themin()
andmax()
functions are applied to theDiscount
column. - The result includes each group (defined by the combination of
Courses
andDuration
) along with the respective aggregation results.
Using Expressions with groupby()
You can also use positional arguments to group by multiple columns in the same way, and expressions are supported as well for defining custom groupings or computations.
# Using expressions with groupby()
result = df.group_by("Courses", pl.col("Fee") // 2).agg(pl.col("Discount").mean())
print(result)
# Output:
# shape: (9, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fee ┆ Discount │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════════╪═══════╪══════════╡
│ Pandas ┆ 13000 ┆ 2500.0 │
│ Python ┆ 12000 ┆ 1200.0 │
│ Spark ┆ 11000 ┆ 1000.0 │
│ Hadoop ┆ 11500 ┆ 1000.0 │
│ Hadoop ┆ 12500 ┆ null │
│ NA ┆ 750 ┆ 0.0 │
│ Python ┆ 11000 ┆ 1600.0 │
│ PySpark ┆ 12500 ┆ 2300.0 │
│ Spark ┆ 12500 ┆ 1400.0 │
└─────────┴───────┴──────────┘
Similarly, you can enhance your groupby()
operations by using expressions, which allow for more flexibility and customization in grouping and aggregations. Expressions can be used to create computed columns or perform complex transformations during the grouping process.
# Group by Category and use expressions for custom aggregations
result = df.group_by("Courses").agg([
(pl.col("Fee") + 100).sum().alias("Sum_Plus_100"),
(pl.col("Discount") * 2).mean().alias("Mean_Double_Values"),
(pl.col("Fee").rank(descending=True)).max().alias("Max_Rank")
])
print(result)
# Output:
# shape: (6, 4)
┌─────────┬──────────────┬────────────────────┬──────────┐
│ Courses ┆ Sum_Plus_100 ┆ Mean_Double_Values ┆ Max_Rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ f64 │
╞═════════╪══════════════╪════════════════════╪══════════╡
│ Hadoop ┆ 48200 ┆ 2000.0 ┆ 2.0 │
│ Python ┆ 46200 ┆ 2800.0 ┆ 2.0 │
│ NA ┆ 1600 ┆ 0.0 ┆ 1.0 │
│ Spark ┆ 47200 ┆ 2400.0 ┆ 2.0 │
│ PySpark ┆ 25100 ┆ 4600.0 ┆ 1.0 │
│ Pandas ┆ 26100 ┆ 5000.0 ┆ 1.0 │
└─────────┴──────────────┴────────────────────┴──────────┘
Conclusion
In this article, I have explained the Polars DataFrame groupby()
method by using syntax, parameters, and usage. This method returns a GroupBy object, allowing you to perform aggregation operations on the grouped data. Additionally, I will cover how to group by a single column or multiple columns using groupby()
, with examples.
Happy Learning !!
Related Articles
- Polars DataFrame.filter() Usage & Examples
- Polars DataFrame.sort() Method
- Polars DataFrame.melt() Method
- Polars DataFrame.explode() Method
- Polars DataFrame.join() Explained With Examples