In Polars, the partition_by() function is used to split a DataFrame into multiple smaller DataFrames based on unique values in one or more columns. It is similar to SQL’s PARTITION BY but returns a collection of DataFrames instead of modifying values within the original DataFrame. This is particularly useful when you need to perform operations on distinct groups within your data.
In this article, I will explain the partition_by() function of a Polars DataFrame, covering its syntax, parameters, and usage and how we can return a list of DataFrames, where each DataFrame corresponds to a unique group based on the specified column(s). It essentially splits the DataFrame into multiple smaller DataFrames based on distinct values in the given column(s).
Key Points –
partition_by()splits a DataFrame into multiple DataFrames based on unique values in one or more specified columns.- It takes column names as arguments and returns a list or dictionary of partitioned DataFrames.
- You can partition by one or multiple columns for more granular grouping.
- The
maintain_orderparameter controls whether the original row order is preserved within partitions. - By default, it returns a list of DataFrames, but setting
as_dict=Truereturns a dictionary where keys are unique values. - Unlike SQL’s
PARTITION BY, which is used in window functions, Polars’partition_by()splits the DataFrame into separate subsets. - The
partition_byparameter only affects the writing process, and does not modify the in-memory DataFrame. - When partitioning, Polars creates subdirectories named after unique values in the partition column.
Polars DataFrame partition_by() Introduction
Let’s know the syntax of the Polars DataFrame partition_by() function.
# Syntax of partition_by()
DataFrame.partition_by(
by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector],
*more_by: ColumnNameOrSelector,
maintain_order: bool = True,
include_key: bool = True,
as_dict: bool = False,
) → list[DataFrame] | dict[tuple[object, ...], DataFrame]
Parameters of the Polars DataFrame.partition_by()
Following are the parameters of the partition_by() function.
by(str or list[str]) – Column(s) used for partitioning.*more_by(str, optional) – Additional column(s) for partitioning.maintain_order(bool, default=True) – Maintains the original row order within partitions ifTrue.include_key(bool, default=True) – Includes partitioning column(s) in the output ifTrue.as_dict(bool, default=False) –- If
False: Returns a list ofDataFrames. - If
True: Returns a dictionary, where keys are tuples representing group values, and values areDataFrames.
- If
Return Value
This function returns a list of DataFrames (default) OR a dictionary {(key, …): DataFrame} (if as_dict=True).
Usage of Polars DataFrame partition_by() Function
The partition_by() function in Polars splits a DataFrame into multiple smaller DataFrames based on unique values in one or more specified columns. Each resulting DataFrame contains only the rows belonging to a specific partition.
To run some examples of the Polars DataFrame partition_by() function, let’s create a Polars DataFrame.
import polars as pl
technologies= ({
'Courses':["spark","python","spark","python","pandas"],
'Fees' :[22000,25000,22000,25000,24000],
'Duration':['30days','40days','60days','40days','50days'],
'Discount':[1000,1500,1000,2000,2500]
})
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.

To partition the Polars DataFrame by a single column (e.g., "Courses"), we can use the partition_by() function. This will split the DataFrame into multiple DataFrames based on unique values in the "Courses" column.
# Partitioning by 'Courses' column
partitions = df.partition_by("Courses")
# Printing each partitioned DataFrame
for partition in partitions:
print(partition)
Here,
df.partition_by("Courses")splits the DataFrame based on unique values in the"Courses"column.- The result is a list of DataFrames, each containing rows with the same
"Courses"value. - We iterate through the partitions and print each subset.

Partitioning by Multiple Columns
To partition a Polars DataFrame by multiple columns, we can pass a list of column names to the partition_by() function. This will create partitions based on unique combinations of values from the specified columns.
# Partitioning by multiple columns: 'Courses' and 'Duration'
partitions = df.partition_by(["Courses", "Duration"])
for partition in partitions:
print(partition)
# Output:
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ python ┆ 25000 ┆ 40days ┆ 1500 │
│ python ┆ 25000 ┆ 40days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ spark ┆ 22000 ┆ 60days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas ┆ 24000 ┆ 50days ┆ 2500 │
└─────────┴───────┴──────────┴──────────┘
Here,
df.partition_by(["Courses", "Duration"])groups rows based on unique combinations of"Courses"and"Duration".- The output is a list of DataFrames, where each subset contains rows sharing the same
"Courses"and"Duration". - We iterate through the partitions and print each subset.
Returning Partitions as a Dictionary
To return partitions as a dictionary, use the as_dict=True argument in partition_by(). This will generate a dictionary where: Keys are tuples representing unique combinations of values from the partitioning columns. Values are the corresponding DataFrames for each partition.
# Partitioning by 'Courses' and returning as a dictionary
partitions = df.partition_by("Courses", as_dict=True)
for key, value in partitions.items():
print(f"Partition for {key}:\n", value)
# Output:
# Partition for ('spark',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ spark ┆ 22000 ┆ 30days ┆ 1000 │
│ spark ┆ 22000 ┆ 60days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('python',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ python ┆ 25000 ┆ 40days ┆ 1500 │
│ python ┆ 25000 ┆ 40days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('pandas',):
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas ┆ 24000 ┆ 50days ┆ 2500 │
└─────────┴───────┴──────────┴──────────┘
Here,
df.partition_by("Courses", as_dict=True)creates a dictionary of partitions.- The keys are the unique values in
"Courses"(e.g.,"spark","python","pandas"). - The values are DataFrames containing only rows matching each
"Courses"value. - We iterate over the dictionary to print each partition.
Improving Performance by Disabling Order Maintenance
By default, partition_by() maintains the original row order within each partition. However, this can slow down performance, especially for large datasets.
# Partitioning without maintaining order
partitions = df.partition_by("Courses", maintain_order=False, as_dict=True)
for key, value in partitions.items():
print(f"Partition for {key}:\n", value)
# Output:
# Partition for ('spark',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ spark ┆ 22000 ┆ 30days ┆ 1000 │
│ spark ┆ 22000 ┆ 60days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('pandas',):
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas ┆ 24000 ┆ 50days ┆ 2500 │
└─────────┴───────┴──────────┴──────────┘
# Partition for ('python',):
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ python ┆ 25000 ┆ 40days ┆ 1500 │
│ python ┆ 25000 ┆ 40days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
Conclusion
In conclusion, the partition_by() function in Polars is a powerful tool for splitting DataFrames into multiple partitions based on column values. It enables efficient data partitioning for analysis, storage, and parallel processing.
Happy Learning!!
Related Articles
- Polars DataFrame count() Function
- Polars DataFrame limit() Method
- How to Transpose DataFrame
- Add New Columns to Polars DataFrame
- Polars DataFrame median() Usage & Examples
- Polars DataFrame std() – Usage & Examples
- Polars DataFrame tail() – Usage & Examples
- Polars DataFrame update() – Usage & Examples
- How to Convert String to Date or Datetime in Polars
- Polars.DataFrame.mean() – Explained by Examples