• Post author:
  • Post category:Polars
  • Post last modified:December 27, 2024
  • Reading time:13 mins read

The DataFrame.group_by() method in polars is used to group the DataFrame by one or more columns (or expressions) and allows for applying aggregations or transformations to each group. It functions similarly to SQL’s GROUP BY but is optimized for better performance, particularly with large datasets.

Advertisements

In Polars, you can use group_by() with the combination of sum(), mean(), max(), min(), aggregate(), and many more methods to perform various operations on grouped data. In this article, I will explain the Polars DataFrame group_by() method by using its syntax, parameters, and usage, and how to group by a single column, or multiple columns by using groupby() with examples.

Key Points –

  • The group_by() method groups rows in a DataFrame based on the unique values of one or more specified columns or expressions.
  • After grouping, you can apply aggregation functions such as sum(), mean(), min(), max(), count(), and more on the grouped data.
  • You can group data by a single column or multiple columns by providing a list of column names.
  • Supports grouping by expressions, allowing more complex operations, such as applying functions or transformations during the grouping process.
  • When set to True, this flag preserves the original row order within groups, which can be useful in certain analyses.
  • Allows custom aggregations by using agg() with column expressions, enabling flexible data transformations.
  • Designed for high performance, groupby() operations in Polars are efficient and suitable for large-scale datasets.

Polars DataFrame.groupby() Introduction

Let’s know the syntax of the Polars DataFrame groupby() method.


# Syntax of groupby()
DataFrame.group_by(
    by: IntoExpr | Iterable[IntoExpr],   # Column(s) or expression(s) to group by
    maintain_order: bool = False,         # Whether to maintain the original order of rows
    **named_by: IntoExpr,                 # Additional named aggregation expressions (optional)
) → GroupBy

Parameters of the Polars DataFrame.groupby()

Following are the parameters of the polars DataFrame.groupby() method.

  • by – Specifies the column(s) or expression(s) to group by. This can be a single column or a list of columns (or expressions).
  • maintain_order – A boolean flag that, when set to True, keeps the original row order within each group. Default: False.
  • **named_by – Optional named expressions that are used in aggregation operations. You can pass additional aggregations for grouped data. For example, "column_name" or an aggregation expression like pl.col("col_name").sum().

Return Value

It returns a GroupBy object, which can then be used to apply aggregation functions (e.g., sum(), mean(), etc.) using the .agg() method.

Usage of Polars DataFrame.group_by()

The groupby() method is used to group rows in a DataFrame based on one or more columns or expressions, allowing for efficient aggregation and transformation of grouped data.

In order to explain several examples of how to perform polars group by on DataFrame, first, let’s create a DataFrame.


import polars as pl

technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars groupby

Here’s how you can group by one column and compute the grouped sum of another column using Polars.


# Group by 'Couses' and calculate the sum of 'Fee'
df2 = df.group_by("Courses").agg(pl.col("Fee").sum())
print("Get sum of grouped data:\n", df2)

Here,

  • df.group_by("Courses") groups the DataFrame by unique values in the column "Courses". Each unique course becomes a group.
  • agg(pl.col("Fee").sum()) calculates the total of the values in the “Fee” column for each group. This operation combines all rows within the same group into a single row, with the “Fee” column displaying the summed value.
  • The output is a new DataFrame (df2) where each row represents a unique course, along with the total fee for that course.
polars groupby

The group_by() function groups the DataFrame based on a specific column, while the agg() function performs aggregate operations on the grouped data. In your case, you are grouping the DataFrame by the “Courses” column and aggregating the “Discount” column.


# Use grouping with maintain_order
result = df.group_by("Courses", maintain_order=True).agg(pl.col("Discount"))
print(result)

# Output:
# shape: (6, 2)
┌─────────┬──────────────┐
│ Courses ┆ Discount     │
│ ---     ┆ ---          │
│ str     ┆ list[i64]    │
╞═════════╪══════════════╡
│ Spark   ┆ [1000, 1400] │
│ PySpark ┆ [2300]       │
│ Hadoop  ┆ [1000, null] │
│ Python  ┆ [1200, 1600] │
│ Pandas  ┆ [2500]       │
│ NA      ┆ [0]          │
└─────────┴──────────────┘

In the above example, the maintain_order=True argument ensures that the order of the rows in the output remains the same as in the original DataFrame. The aggregation of the “Discount” column for each group will result in a list of values for each “Courses” group.

Polars group_by() on Two or More Columns

Alternatively, you can group data by two or more columns by passing a list of column names to the group_by() function. After grouping, you can apply aggregations, transformations, or other operations to the grouped data.


# Group by both 'Courses' and 'Duration', and aggregate using sum
result = df.group_by(['Courses', 'Duration']).agg(
    pl.col('Fee').sum().alias('Total'))
print(result)

# Output:
# shape: (8, 3)
┌─────────┬──────────┬───────────┐
│ Courses ┆ Duration ┆ Total │
│ ---     ┆ ---      ┆ ---       │
│ str     ┆ str      ┆ i64       │
╞═════════╪══════════╪═══════════╡
│ Spark   ┆ 30days   ┆ 47000     │
│ NA      ┆ 40days   ┆ 1500      │
│ Hadoop  ┆ 55days   ┆ 23000     │
│ PySpark ┆ 50days   ┆ 25000     │
│ Python  ┆ 40days   ┆ 24000     │
│ Python  ┆ 50days   ┆ 22000     │
│ Hadoop  ┆ 35days   ┆ 25000     │
│ Pandas  ┆ 60days   ┆ 26000     │
└─────────┴──────────┴───────────┘

Here,

  • The groupby() function is used to group the DataFrame by both the Courses and Duration columns.
  • The agg() function is then applied to perform an aggregation on the Fee column, calculating the sum for each group.
  • The result shows the sum of Fee for each combination of Courses and Duration.

GroupBy with Multiple Aggregations

You can perform groupby() operations with multiple aggregations on a polars DataFrame. By passing multiple aggregation functions within the agg() method, you can compute a variety of summary statistics for each group in a single step.


# Group by both 'Courses' and 'Duration', apply multiple aggregations
result = df.group_by(['Courses', 'Duration']).agg([
    pl.col('Fee').sum().alias('Total_Fee'),
    pl.col('Fee').mean().alias('Mean_Fee'),
    pl.col('Fee').max().alias('Max_Fee'),
    pl.col('Discount').min().alias('Min_Discount'),
    pl.col('Discount').max().alias('Max_Discount')
])
print(result)

# Output:
# shape: (8, 7)
┌─────────┬──────────┬───────────┬──────────┬─────────┬──────────────┬──────────────┐
│ Courses ┆ Duration ┆ Total_Fee ┆ Mean_Fee ┆ Max_Fee ┆ Min_Discount ┆ Max_Discount │
│ ---     ┆ ---      ┆ ---       ┆ ---      ┆ ---     ┆ ---          ┆ ---          │
│ str     ┆ str      ┆ i64       ┆ f64      ┆ i64     ┆ i64          ┆ i64          │
╞═════════╪══════════╪═══════════╪══════════╪═════════╪══════════════╪══════════════╡
│ Python  ┆ 50days   ┆ 22000     ┆ 22000.0  ┆ 22000   ┆ 1600         ┆ 1600         │
│ PySpark ┆ 50days   ┆ 25000     ┆ 25000.0  ┆ 25000   ┆ 2300         ┆ 2300         │
│ Hadoop  ┆ 55days   ┆ 23000     ┆ 23000.0  ┆ 23000   ┆ 1000         ┆ 1000         │
│ Pandas  ┆ 60days   ┆ 26000     ┆ 26000.0  ┆ 26000   ┆ 2500         ┆ 2500         │
│ Spark   ┆ 30days   ┆ 47000     ┆ 23500.0  ┆ 25000   ┆ 1000         ┆ 1400         │
│ Hadoop  ┆ 35days   ┆ 25000     ┆ 25000.0  ┆ 25000   ┆ null         ┆ null         │
│ Python  ┆ 40days   ┆ 24000     ┆ 24000.0  ┆ 24000   ┆ 1200         ┆ 1200         │
│ NA      ┆ 40days   ┆ 1500      ┆ 1500.0   ┆ 1500    ┆ 0            ┆ 0            │
└─────────┴──────────┴───────────┴──────────┴─────────┴──────────────┴──────────────┘

Here,

  • The groupby() function groups the DataFrame by both the Courses and Duration columns.
  • The agg() function is used to perform multiple aggregation operations. The sum(), mean(), and max() functions are applied to the Fee column, while the min() and max() functions are applied to the Discount column.
  • The result includes each group (defined by the combination of Courses and Duration) along with the respective aggregation results.

Using Expressions with groupby()

You can also use positional arguments to group by multiple columns in the same way, and expressions are supported as well for defining custom groupings or computations.


# Using expressions with groupby()
result = df.group_by("Courses", pl.col("Fee") // 2).agg(pl.col("Discount").mean())  
print(result)

# Output:
# shape: (9, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fee   ┆ Discount │
│ ---     ┆ ---   ┆ ---      │
│ str     ┆ i64   ┆ f64      │
╞═════════╪═══════╪══════════╡
│ Pandas  ┆ 13000 ┆ 2500.0   │
│ Python  ┆ 12000 ┆ 1200.0   │
│ Spark   ┆ 11000 ┆ 1000.0   │
│ Hadoop  ┆ 11500 ┆ 1000.0   │
│ Hadoop  ┆ 12500 ┆ null     │
│ NA      ┆ 750   ┆ 0.0      │
│ Python  ┆ 11000 ┆ 1600.0   │
│ PySpark ┆ 12500 ┆ 2300.0   │
│ Spark   ┆ 12500 ┆ 1400.0   │
└─────────┴───────┴──────────┘

Similarly, you can enhance your groupby() operations by using expressions, which allow for more flexibility and customization in grouping and aggregations. Expressions can be used to create computed columns or perform complex transformations during the grouping process.


# Group by Category and use expressions for custom aggregations
result = df.group_by("Courses").agg([
    (pl.col("Fee") + 100).sum().alias("Sum_Plus_100"),
    (pl.col("Discount") * 2).mean().alias("Mean_Double_Values"),
    (pl.col("Fee").rank(descending=True)).max().alias("Max_Rank")
])
print(result)

# Output:
# shape: (6, 4)
┌─────────┬──────────────┬────────────────────┬──────────┐
│ Courses ┆ Sum_Plus_100 ┆ Mean_Double_Values ┆ Max_Rank │
│ ---     ┆ ---          ┆ ---                ┆ ---      │
│ str     ┆ i64          ┆ f64                ┆ f64      │
╞═════════╪══════════════╪════════════════════╪══════════╡
│ Hadoop  ┆ 48200        ┆ 2000.0             ┆ 2.0      │
│ Python  ┆ 46200        ┆ 2800.0             ┆ 2.0      │
│ NA      ┆ 1600         ┆ 0.0                ┆ 1.0      │
│ Spark   ┆ 47200        ┆ 2400.0             ┆ 2.0      │
│ PySpark ┆ 25100        ┆ 4600.0             ┆ 1.0      │
│ Pandas  ┆ 26100        ┆ 5000.0             ┆ 1.0      │
└─────────┴──────────────┴────────────────────┴──────────┘

Conclusion

In this article, I have explained the Polars DataFrame groupby() method by using syntax, parameters, and usage. This method returns a GroupBy object, allowing you to perform aggregation operations on the grouped data. Additionally, I will cover how to group by a single column or multiple columns using groupby(), with examples.

Happy Learning !!

References