• Post author:
  • Post category:Polars
  • Post last modified:December 13, 2024
  • Reading time:8 mins read

In Polars, the DataFrame.filter() method is used to filter the rows of a DataFrame based on a specified condition or boolean expression. It returns a new DataFrame that includes only the rows where the condition is evaluated as True.

Advertisements

In this article, I will explain the Polars DataFrame.filter() method by using its syntax, parameters, and usage to demonstrate how it returns a new DataFrame containing only the rows that meet the specified condition or boolean expression.

Syntax of Polars DataFrame.filter()

Let’s know the syntax of the DataFrame.filter() method.


# Syntax of polars DataFrame.filter()
DataFrame.filter(expr: pl.Expr) -> DataFrame

Parameters of the Polars DataFrame.filter()

Following are the parameters of the polars DataFrame.filter() method.

  • expr – A Polars expression (typically created using pl.col(), logical operators, or comparison operators) that defines the condition to filter the rows. The condition must return a boolean value for each row.

Return Value

  • It returns a new DataFrame containing only the rows that satisfy the given condition.

Usage of Polars DataFrame.filter()

The filter() method in Polars allows you to filter rows based on a condition, returning only the rows that satisfy the given condition.

Now, let’s create a Polars DataFrame using data from a dictionary.


import polars as pl

# Creating a new Polars DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fees' :[22000,25000,20000,24000,26000],
    'Duration':['30days','50days','40days','50days','40days'],
    'Discount':[1000,2300,1500,1200,2500]
}

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars filter

Filter Rows based on a Single Condition

You can use the filter() method in Polars to filter rows of a DataFrame based on a single condition.


# Filter rows where Fees > 24000
filtered_df = df.filter(pl.col("Fees") > 24000)
print("Filtered DataFrame:\n", filtered_df)

Here,

  • pl.col("Fees") > 24000 checks which rows have the Fees column value greater than 24000.
  • Filters the rows based on the condition and returns a new DataFrame.
polars filter

Filter Rows based on Multiple Conditions (AND)

To filter rows based on multiple conditions with AND logic in Polars, you can combine conditions using the & operator.


# Filter rows with Fees > 24000 AND Discount > 2000
filtered_df = df.filter((pl.col("Fees") > 24000) & (pl.col("Discount") > 2000))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • pl.col("Fees") > 24000: Selects rows where the Fees column has values greater than 24000.
  • pl.col("Discount") > 2000: Selects rows where the Discount column has values greater than 2000.
  • The & operator combines the two conditions, and only rows that satisfy both are included in the filtered DataFrame.
  • Returns a new DataFrame containing only the rows that meet the combined conditions.

Filter Rows Using Logical OR

To filter rows using a logical OR condition in Polars, you can combine multiple conditions using the | operator.


# Filter using Logical OR
filtered_df = df.filter((df["Fees"] > 24000) | (df["Duration"] == "50days"))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The condition (df["Fees"] > 24000) | (df["Duration"] == "50days") filters the rows where either the Fees are greater than 24000 OR the Duration is '50days'.

Filter Rows with a Column Value in a Range

To filter rows where the value of a column falls within a specific range in Polars, you can use the is_in() method or use a combination of logical conditions with the & (AND) operator to specify the range.


# Filter rows where Fees is between 22000 and 25000
filtered_df = df.filter((pl.col("Fees") >= 22000) & (pl.col("Fees") <= 25000))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • (pl.col("Fees") >= 22000) & (pl.col("Fees") <= 25000) filters the rows where Fees is between 22000 and 25000 (inclusive).
  • Returns a new DataFrame containing only rows that satisfy this condition.

Conclusion

In this article, I have explained the Polars DataFrame filter() method by using its syntax, parameters, usage, and how it returns a new DataFrame containing rows that meet the specified conditions based on column values. By using comparison operators and logical operators like & (AND) and | (OR), you can filter rows based on single or multiple conditions. Additionally, Polars allows filtering rows based on a range of values by combining conditions.

Happy Learning!!

Reference