In Polars, the DataFrame.filter()
method is used to filter the rows of a DataFrame based on a specified condition or boolean expression. It returns a new DataFrame that includes only the rows where the condition is evaluated as True
.
In this article, I will explain the Polars DataFrame.filter()
method by using its syntax, parameters, and usage to demonstrate how it returns a new DataFrame containing only the rows that meet the specified condition or boolean expression.
Syntax of Polars DataFrame.filter()
Let’s know the syntax of the DataFrame.filter() method.
# Syntax of polars DataFrame.filter()
DataFrame.filter(expr: pl.Expr) -> DataFrame
Parameters of the Polars DataFrame.filter()
Following are the parameters of the polars DataFrame.filter() method.
expr
– A Polars expression (typically created usingpl.col()
, logical operators, or comparison operators) that defines the condition to filter the rows. The condition must return a boolean value for each row.
Return Value
- It returns a new DataFrame containing only the rows that satisfy the given condition.
Usage of Polars DataFrame.filter()
The filter()
method in Polars allows you to filter rows based on a condition, returning only the rows that satisfy the given condition.
Now, let’s create a Polars DataFrame using data from a dictionary.
import polars as pl
# Creating a new Polars DataFrame
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fees' :[22000,25000,20000,24000,26000],
'Duration':['30days','50days','40days','50days','40days'],
'Discount':[1000,2300,1500,1200,2500]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
Filter Rows based on a Single Condition
You can use the filter()
method in Polars to filter rows of a DataFrame based on a single condition.
# Filter rows where Fees > 24000
filtered_df = df.filter(pl.col("Fees") > 24000)
print("Filtered DataFrame:\n", filtered_df)
Here,
pl.col("Fees") > 24000
checks which rows have theFees
column value greater than24000
.- Filters the rows based on the condition and returns a new DataFrame.
Filter Rows based on Multiple Conditions (AND)
To filter rows based on multiple conditions with AND logic in Polars, you can combine conditions using the &
operator.
# Filter rows with Fees > 24000 AND Discount > 2000
filtered_df = df.filter((pl.col("Fees") > 24000) & (pl.col("Discount") > 2000))
print("Filtered DataFrame:\n", filtered_df)
# Output:
# Filtered DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Pandas ┆ 26000 ┆ 40days ┆ 2500 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Fees") > 24000
: Selects rows where theFees
column has values greater than24000
.pl.col("Discount") > 2000
: Selects rows where theDiscount
column has values greater than2000
.- The
&
operator combines the two conditions, and only rows that satisfy both are included in the filtered DataFrame. - Returns a new DataFrame containing only the rows that meet the combined conditions.
Filter Rows Using Logical OR
To filter rows using a logical OR condition in Polars, you can combine multiple conditions using the |
operator.
# Filter using Logical OR
filtered_df = df.filter((df["Fees"] > 24000) | (df["Duration"] == "50days"))
print("Filtered DataFrame:\n", filtered_df)
# Output:
# Filtered DataFrame:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Python ┆ 24000 ┆ 50days ┆ 1200 │
│ Pandas ┆ 26000 ┆ 40days ┆ 2500 │
└─────────┴───────┴──────────┴──────────┘
Here,
- The condition
(df["Fees"] > 24000) | (df["Duration"] == "50days")
filters the rows where either theFees
are greater than 24000 OR theDuration
is'50days'
.
Filter Rows with a Column Value in a Range
To filter rows where the value of a column falls within a specific range in Polars, you can use the is_in()
method or use a combination of logical conditions with the &
(AND) operator to specify the range.
# Filter rows where Fees is between 22000 and 25000
filtered_df = df.filter((pl.col("Fees") >= 22000) & (pl.col("Fees") <= 25000))
print("Filtered DataFrame:\n", filtered_df)
# Output:
# Filtered DataFrame:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Python ┆ 24000 ┆ 50days ┆ 1200 │
└─────────┴───────┴──────────┴──────────┘
Here,
(pl.col("Fees") >= 22000) & (pl.col("Fees") <= 25000)
filters the rows whereFees
is between 22000 and 25000 (inclusive).- Returns a new DataFrame containing only rows that satisfy this condition.
Conclusion
In this article, I have explained the Polars DataFrame filter()
method by using its syntax, parameters, usage, and how it returns a new DataFrame containing rows that meet the specified conditions based on column values. By using comparison operators and logical operators like &
(AND) and |
(OR), you can filter rows based on single or multiple conditions. Additionally, Polars allows filtering rows based on a range of values by combining conditions.
Happy Learning!!