• Post author:
  • Post category:Polars
  • Post last modified:December 16, 2024
  • Reading time:12 mins read

The Polars filter() function is used to filter rows in a DataFrame based on one or more conditions. When you want to filter a DataFrame with multiple conditions, you can combine these conditions using logical operators like AND (&), OR (|), and NOT (~). This allows you to apply complex filters that can involve different columns and data types, including numerical, string, and date values. In this article, I will explain the polars filter DataFrame with multiple conditions.

Advertisements

Key Points –

  • Use the .filter() method in Polars to filter rows based on specified conditions.
  • Combine multiple conditions using logical operators: & (AND): Requires all conditions to be true. | (OR): Requires at least one condition to be true. ~ (NOT): Negates a condition.
  • Access columns in conditions using df["column_name"] syntax.
  • Filters support a wide range of comparison operators, including >, <, >=, <=, ==, and !=.
  • Filtering in Polars is optimized for performance, handling large datasets efficiently.
  • Filters can be applied to various data types, such as numerical, string, datetime, and categorical columns.

Usage of Polars Filter DataFrame with Multilple Conditions

In Polars, filter DataFrame with multiple conditions allows you to retrieve rows that meet certain criteria. The filter method in Polars works with expressions that can combine conditions using logical operators such as & (AND), | (OR), and ~ (NOT).

Now, let’s create a Polars DataFrame using data from a dictionary.


import polars as pl

# Creating a new Polars DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fees' :[22000,25000,20000,24000,26000],
    'Duration':['30days','50days','40days','50days','40days'],
    'Discount':[1000,2300,1500,1200,2500]
}

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

Polars filter dataframe multilple conditions

Filter Rows Based on Numerical Conditions

To filter rows based on numerical conditions in Polars, you can use the filter method with expressions that specify the desired conditions.


# Filtering rows where Fees are greater than 23000
filtered_df = df.filter(df["Fees"] > 23000)
print("Filtered DataFrame:\n", filtered_df)

Here,

  • The filter expression (df["Fees"] > 23000) selects rows where the Fees column is greater than 23,000.
  • You can combine multiple conditions using logical operators like & (AND), | (OR), and ~ (NOT).
  • Polars processes filters efficiently, making it suitable for large datasets.
Polars filter dataframe multilple conditions

Alternatively, to filter rows where the Discount is greater than the average Discount in Polars, you can calculate the average discount using the .mean() function and then filter the DataFrame based on this condition.


# Calculate the average discount
avg_discount = df["Discount"].mean()

# Filter rows where Discount is greater than the average
filtered_df = df.filter(df["Discount"] > avg_discount)
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The average of the Discount column is calculated using df["Discount"].mean().
  • The filter method is used to select rows where the Discount is greater than the calculated average.
  • Rows with discounts greater than the average are returned.

Filter Rows with a String Condition

To filter rows based on a string condition in Polars, you can use the filter method with expressions that evaluate string columns.


# Filter rows where the course is "Python"
filtered_df = df.filter(df["Courses"] == "Python")
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The condition df["Courses"] == "Python" checks if the Courses column is equal to "Python".
  • The filter method filters the rows that meet the condition.
  • Only the row with "Python" as the course name is returned.

Filter Using Logical OR

To filter rows using logical OR in Polars, you can combine multiple conditions with the | operator. This allows you to select rows where at least one of the conditions is true.


# Filter rows where Fees are greater than 24000 OR Course is "Python"
filtered_df = df.filter((df["Fees"] > 24000) | (df["Courses"] == "Python"))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Python  ┆ 24000 ┆ 50days   ┆ 1200     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The condition (df["Fees"] > 24000)|(df["Courses"] == "Python") uses | (OR) to select rows where either: The Fees are greater than 24,000, OR The Courses column is "Python".
  • The filter method is used to apply this condition to the DataFrame.
  • Rows are returned where either condition is met, which includes: "PySpark" (Fees > 24000), "Python" (Exact match in Courses), "Pandas" (Fees > 24000).

Filter with Negation (NOT)

To filter rows where a column does not match a specific value, you can use the negation operator (~) in combination with the equality condition. This will allow you to exclude rows that match the given value, retaining only those that don’t match.


# Filter where the course is NOT "Python"
filtered_df = df.filter(~(df["Courses"] == "Python"))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 22000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 50days   ┆ 2300     │
│ Hadoop  ┆ 20000 ┆ 40days   ┆ 1500     │
│ Pandas  ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴───────┴──────────┴──────────┘

Filter with Date Conditions

You can filter rows based on date conditions in Polars by using expressions that compare date columns with specific date values. Polars supports operations like ==, >, <, >=, and <= for filtering DataFrames based on dates.


import polars as pl
from datetime import date

# Creating a Polars DataFrame with a Date column
data = {
    'Courses': ["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
    'Start_Date': [date(2024, 1, 15), date(2024, 5, 20), date(2024, 3, 10), date(2024, 7, 5), date(2024, 4, 18)],
    'Fees': [22000, 25000, 20000, 24000, 26000],
    'Duration': ['30days', '50days', '40days', '50days', '40days'],
    'Discount': [1000, 2300, 1500, 1200, 2500]
}

df = pl.DataFrame(data)

# Filter rows where the Start_Date is after January 1, 2024
filtered_df = df.filter(df["Start_Date"] > date(2024, 1, 15))
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (4, 5)
┌─────────┬────────────┬───────┬──────────┬──────────┐
│ Courses ┆ Start_Date ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---        ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ date       ┆ i64   ┆ str      ┆ i64      │
╞═════════╪════════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 2024-05-20 ┆ 25000 ┆ 50days   ┆ 2300     │
│ Hadoop  ┆ 2024-03-10 ┆ 20000 ┆ 40days   ┆ 1500     │
│ Python  ┆ 2024-07-05 ┆ 24000 ┆ 50days   ┆ 1200     │
│ Pandas  ┆ 2024-04-18 ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴────────────┴───────┴──────────┴──────────┘

Here,

  • The condition df["Start_Date"] > date(2024, 1, 15) filters rows where the Start_Date is after January 15, 2024.
  • The filter method applies this condition to the DataFrame to return rows meeting the date condition.

Filter with Multiple Conditions Across Data Types

Similarly, you can filter rows with multiple conditions across different data types by combining conditions that apply to different columns in Polars. You can filter based on numerical, string, or date columns, and combine these conditions using logical operators like AND (&), OR (|), and NOT (~).


# Filter rows where:
# Fees are greater than 22000
# The course is "Python" or "Pandas"
# Start date is after March 1, 2023
filtered_df = df.filter(
    (df["Fees"] > 22000) &
    (df["Courses"].is_in(["Python", "Pandas"])) &
    (df["Start_Date"] > date(2024, 3, 1))
)
print("Filtered DataFrame:\n", filtered_df)

# Output:
# Filtered DataFrame:
# shape: (2, 5)
┌─────────┬────────────┬───────┬──────────┬──────────┐
│ Courses ┆ Start_Date ┆ Fees  ┆ Duration ┆ Discount │
│ ---     ┆ ---        ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ date       ┆ i64   ┆ str      ┆ i64      │
╞═════════╪════════════╪═══════╪══════════╪══════════╡
│ Python  ┆ 2024-07-05 ┆ 24000 ┆ 50days   ┆ 1200     │
│ Pandas  ┆ 2024-04-18 ┆ 26000 ┆ 40days   ┆ 2500     │
└─────────┴────────────┴───────┴──────────┴──────────┘

Here,

  • (df["Fees"] > 22000) filters rows where Fees are greater than 22,000.
  • (df["Courses"].is_in(["Python", "Pandas"])) filters rows where the Courses column is either "Python" or "Pandas".
  • (df["Start_Date"] > date(2024, 3, 1)) filters rows where the Start_Date is after March 1, 2024.
  • The conditions are combined using & (logical AND), meaning that all conditions must be true for a row to be included in the filtered DataFrame.

Conclusion

In conclusion, filtering a Polars DataFrame with multiple conditions provides a powerful approach to extract rows based on intricate criteria across different column types, including numerical, string, and date values. By using logical operators like AND (&), OR (|), and NOT (~), you can combine multiple conditions in a flexible and efficient way.

Happy Learning!!

Reference