• Post author:
  • Post category:Polars
  • Post last modified:December 18, 2024
  • Reading time:11 mins read

In Polars, the sort() method is used to sort a DataFrame based on one or more columns, allowing customization of the sorting order for ascending or descending arrangements. It is highly efficient and optimized to handle large datasets effortlessly.

Advertisements

In this article, I will explain the Polars DataFrame.sort() method by using its syntax, parameters, and usage to demonstrate how it returns a new DataFrame sorted according to the specified conditions.

Key Points –

  • The sort() method organizes rows of a DataFrame based on specified column(s).
  • It allows sorting by a single column or multiple columns simultaneously.
  • It takes a column name (str) or a list of column names (List[str]) as the by parameter.
  • The sort() method does not modify the original DataFrame but returns a new, sorted DataFrame.
  • It works seamlessly with columns containing integers, floats, strings, and other data types, as long as the column type is consistent.

Polars DataFrame.sort() Introduction

Following is the syntax of the Polars DataFrame sort() method.


# Syntax of polars DataFrame.sort()
DataFrame.sort(
    by: IntoExpr | Iterable[IntoExpr],       # Column(s) or expressions to sort by
    *more_by: IntoExpr,                      # Additional columns/expressions for sorting
    descending: bool | Sequence[bool] = False,  # Sort order: descending or ascending
    nulls_last: bool | Sequence[bool] = False,  # Place nulls at the end or start
    multithreaded: bool = True,              # Use multithreading for sorting
    maintain_order: bool = False             # Maintain order of equal elements
) → DataFrame

Parameters of the Polars DataFrame.sort()

Following are the parameters of the polars DataFrame.sort() method.

  • by – Specifies the column name(s) or expression(s) to sort by. Accepts a single column, multiple columns, or an expression.
  • more_by – Allows sorting by additional columns or expressions after the primary column.
  • descending – Accepts a single True/False value or a list for sorting multiple columns.
    • True – Sort in descending order.
    • False – Sort in ascending order.
  • nulls_last – Specifies whether null values appear at the end (True) or start (False) of the sorted result.
  • multithreaded – Enables multithreaded sorting for better performance. Default is True.
  • maintain_order – Ensures that rows with equal values maintain their original relative order when sorting. Default is False (faster sorting without order preservation).

Usage of Polars DataFrame.sort() Method

The DataFrame.sort() method sorts the rows of a Polars DataFrame based on one or more specified columns. The sorting can be done in ascending (default) or descending order.

Now, let’s create a Polars DataFrame using data from a dictionary.


import polars as pl

# Creating a new Polars DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fees' :[22000,25000,20000,24000,26000],
    'Duration':['30days','50days','40days','50days','40days'],
    'Discount':[1000,2300,1500,1200,2500]
}

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars sort

Sort by a Single Column (Ascending)

To sort by a single column in ascending order in Polars, you can use the sort() method and specify the column name.


# Sorting by 'Fees' in ascending order
sorted_df = df.sort("Fees")
print("DataFrame sorted by 'Fees' (ascending):\n", sorted_df)

Here,

  • The sort() method is used to sort the DataFrame by the column Fees.
  • By default, the sorting is done in ascending order.
  • The Fees column is now sorted from smallest to largest, and the entire DataFrame rows are reordered accordingly.
polars sort

Sorting by a Single Column (Descending)

To sort the DataFrame by a single column in descending order, you can use the sort() method with the descending=True parameter.


# Sorting by "Fees" in descending order
sorted_df = df.sort(by="Fees", descending=True)
print("Sorted DataFrame by Fees (Descending):\n", sorted_df)

# Output:
# Sorted DataFrame by Fees (Descending):
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
└─────────┴───────┴──────────┴──────────┘

In the above examples, This sorts the df DataFrame by the "Fees" column in descending order, from the highest fee to the lowest.

Sorting by Multiple Columns

To sort the DataFrame by multiple columns, you can specify multiple column names in the by parameter and set the corresponding sorting orders in the descending parameter.


# Sorting by "Duration" (ascending) and then by "Fees" (descending)
sorted_df = df.sort(by=["Duration", "Fees"], descending=[False, True])
print("Sorted DataFrame by Duration (Ascending) and Fees (Descending):\n", sorted_df)

# Output:
# Sorted DataFrame by Duration (Ascending) and Fees (Descending):
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The DataFrame is first sorted by "Duration" in ascending order (False).
  • Within the same "Duration" values, the DataFrame is sorted by "Fees" in descending order (True).

To sort by multiple columns in Polars, you can pass a list of columns to the by parameter in the sort() method.


# Use DataFrame sort() method
sorted_df = df.sort(by=["Duration", "Fees"], descending=True)
print("Sorted DataFrame:\n", sorted_df)

# Output:
# Sorted DataFrame:
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘

Sort by a Column of Strings

To sort a Polars DataFrame by a column of strings, you can use the sort() method, just like with numeric columns. Polars will sort string columns in lexicographical (alphabetical) order by default. If you want to sort by strings in ascending or descending order, you can specify the descending parameter.


# Sorting by "Courses" (strings) in ascending order
sorted_df = df.sort(by="Courses", descending=False)
print("Sorted DataFrame by Courses (Ascending):\n", sorted_df)

# Output:
# Sorted DataFrame by Courses (Ascending):
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The DataFrame is sorted by the "Courses" column in ascending order (default behavior when descending=False).
  • The string values are sorted lexicographically (alphabetically).

Sorting by Strings in Descending Order

If you want to sort the "Courses" column in descending order, you can set descending=True.


# Sorting by strings in descending order
sorted_df = df.sort(by="Courses", descending=True)
print("Sorted DataFrame by Courses (Descending):\n", sorted_df)

# Output:
# Sorted DataFrame by Courses (Descending):
# shape: (5, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
└─────────┴───────┴──────────┴──────────┘

Conclusion

In conclusion, the Polars DataFrame.sort() method provides an efficient and versatile approach to sorting data within a Polars DataFrame. It enables sorting by one or more columns, supports both ascending and descending orders, and offers the option for in-place sorting.

Happy Learning!!

Reference