• Post author:
  • Post category:Polars
  • Post last modified:March 3, 2025
  • Reading time:12 mins read
You are currently viewing Polars DataFrame partition_by() Usage & Examples

In Polars, the partition_by() function is used to split a DataFrame into multiple smaller DataFrames based on unique values in one or more columns. It is similar to SQL’s PARTITION BY but returns a collection of DataFrames instead of modifying values within the original DataFrame. This is particularly useful when you need to perform operations on distinct groups within your data.

Advertisements

In this article, I will explain the partition_by() function of a Polars DataFrame, covering its syntax, parameters, and usage and how we can return a list of DataFrames, where each DataFrame corresponds to a unique group based on the specified column(s). It essentially splits the DataFrame into multiple smaller DataFrames based on distinct values in the given column(s).

Key Points –

  • partition_by() splits a DataFrame into multiple DataFrames based on unique values in one or more specified columns.
  • It takes column names as arguments and returns a list or dictionary of partitioned DataFrames.
  • You can partition by one or multiple columns for more granular grouping.
  • The maintain_order parameter controls whether the original row order is preserved within partitions.
  • By default, it returns a list of DataFrames, but setting as_dict=True returns a dictionary where keys are unique values.
  • Unlike SQL’s PARTITION BY, which is used in window functions, Polars’ partition_by() splits the DataFrame into separate subsets.
  • The partition_by parameter only affects the writing process, and does not modify the in-memory DataFrame.
  • When partitioning, Polars creates subdirectories named after unique values in the partition column.

Polars DataFrame partition_by() Introduction

Let’s know the syntax of the Polars DataFrame partition_by() function.


# Syntax of partition_by()
DataFrame.partition_by(
by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
*more_by: ColumnNameOrSelector,
maintain_order: bool = True,
include_key: bool = True,
as_dict: bool = False,
) → list[DataFrame] | dict[tuple[object, ...], DataFrame]

Parameters of the Polars DataFrame.partition_by()

Following are the parameters of the partition_by() function.

  • by (str or list[str]) – Column(s) used for partitioning.
  • *more_by (str, optional) – Additional column(s) for partitioning.
  • maintain_order (bool, default=True) – Maintains the original row order within partitions if True.
  • include_key (bool, default=True) – Includes partitioning column(s) in the output if True.
  • as_dict (bool, default=False) –
    • If False: Returns a list of DataFrames.
    • If True: Returns a dictionary, where keys are tuples representing group values, and values are DataFrames.

Return Value

This function returns a list of DataFrames (default) OR a dictionary {(key, …): DataFrame} (if as_dict=True).

Usage of Polars DataFrame partition_by() Function

The partition_by() function in Polars splits a DataFrame into multiple smaller DataFrames based on unique values in one or more specified columns. Each resulting DataFrame contains only the rows belonging to a specific partition.

To run some examples of the Polars DataFrame partition_by() function, let’s create a Polars DataFrame.


import polars as pl

technologies= ({
    'Courses':["spark","python","spark","python","pandas"],
    'Fees' :[22000,25000,22000,25000,24000],
    'Duration':['30days','40days','60days','40days','50days'],
    'Discount':[1000,1500,1000,2000,2500]
              })
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars partition by

To partition the Polars DataFrame by a single column (e.g., "Courses"), we can use the partition_by() function. This will split the DataFrame into multiple DataFrames based on unique values in the "Courses" column.


# Partitioning by 'Courses' column
partitions = df.partition_by("Courses")

# Printing each partitioned DataFrame
for partition in partitions:
    print(partition)

Here,

  • df.partition_by("Courses") splits the DataFrame based on unique values in the "Courses" column.
  • The result is a list of DataFrames, each containing rows with the same "Courses" value.
  • We iterate through the partitions and print each subset.
polars partition by

Partitioning by Multiple Columns

To partition a Polars DataFrame by multiple columns, we can pass a list of column names to the partition_by() function. This will create partitions based on unique combinations of values from the specified columns.


# Partitioning by multiple columns: 'Courses' and 'Duration'
partitions = df.partition_by(["Courses", "Duration"])
for partition in partitions:
    print(partition)
    
# Output:
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ spark   ┆ 22000 ┆ 30days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ python  ┆ 25000 ┆ 40days   ┆ 1500     │
│ python  ┆ 25000 ┆ 40days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ spark   ┆ 22000 ┆ 60days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 24000 ┆ 50days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • df.partition_by(["Courses", "Duration"]) groups rows based on unique combinations of "Courses" and "Duration".
  • The output is a list of DataFrames, where each subset contains rows sharing the same "Courses" and "Duration".
  • We iterate through the partitions and print each subset.

Returning Partitions as a Dictionary

To return partitions as a dictionary, use the as_dict=True argument in partition_by(). This will generate a dictionary where: Keys are tuples representing unique combinations of values from the partitioning columns. Values are the corresponding DataFrames for each partition.


# Partitioning by 'Courses' and returning as a dictionary
partitions = df.partition_by("Courses", as_dict=True)

for key, value in partitions.items():
    print(f"Partition for {key}:\n", value)
    
# Output:
# Partition for ('spark',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ spark   ┆ 22000 ┆ 60days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('python',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ python  ┆ 25000 ┆ 40days   ┆ 1500     │
│ python  ┆ 25000 ┆ 40days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('pandas',):
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 24000 ┆ 50days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • df.partition_by("Courses", as_dict=True) creates a dictionary of partitions.
  • The keys are the unique values in "Courses" (e.g., "spark", "python", "pandas").
  • The values are DataFrames containing only rows matching each "Courses" value.
  • We iterate over the dictionary to print each partition.

Improving Performance by Disabling Order Maintenance

By default, partition_by() maintains the original row order within each partition. However, this can slow down performance, especially for large datasets.


# Partitioning without maintaining order
partitions = df.partition_by("Courses", maintain_order=False, as_dict=True)

for key, value in partitions.items():
    print(f"Partition for {key}:\n", value)
    
# Output:
# Partition for ('spark',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ spark   ┆ 22000 ┆ 60days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('pandas',):
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 24000 ┆ 50days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('python',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ python  ┆ 25000 ┆ 40days   ┆ 1500     │
│ python  ┆ 25000 ┆ 40days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Conclusion

In conclusion, the partition_by() function in Polars is a powerful tool for splitting DataFrames into multiple partitions based on column values. It enables efficient data partitioning for analysis, storage, and parallel processing.

Happy Learning!!

References