In Polars, the isin()
function is used to filter rows by checking if the values in a column are present in a specified list, Series, or another column, similar to the SQL IN clause or Pandas’ isin()
. It returns a Boolean mask that indicates whether each value in the column matches any of the values in the given list or Series. In this article, I will explain how to use isin
in Polars DataFrame.
Key Points –
isin()
is used to check if values in a column exist in a provided list, Series, or another column.- It returns a Boolean mask indicating whether each element in the column matches any value from the list or Series.
isin()
works with various data types, including strings, numbers, and Booleans.- You can combine
isin()
with thefilter()
function to select rows based on matching values. - The
isin()
method can be used with numeric columns, just like with string columns, to filter rows by numbers. - Logical operators like
&
,|
, and~
can be used to combineisin()
with other conditions for more complex filtering. isin()
can be used insidewith_columns()
to create new columns based on membership conditions.- To exclude values, you can negate
isin()
using the~
(tilde) operator. isin()
can be used on Boolean columns to filter or mark rows based on True/False values.
Usage of isin in Polars DataFrame
The isin()
method in Polars is used to check whether the values in a column exist in a specified list or Series. It returns a Boolean mask (True/False) that you can use to filter or select rows where the values in the specified column are in the given set.
To run some examples of how to use isin in Polars DataFrame, let’s create a Polars DataFrame.
import polars as pl
# Creating a new Polars DataFrame
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fees' :[22000,25000,20000,24000,26000],
'Duration':['30days','50days','40days','50days','40days'],
'Discount':[1000,2300,1500,1200,2500]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To filter rows with matching values in a Polars DataFrame, use the filter()
function along with the isin()
method. The isin()
method checks whether values in a column are present in a given list or Series and returns a Boolean mask, which can then be used to select the matching rows.
# Filter: only keep rows where 'Courses' is 'Spark' or 'Python'
df2 = df.filter(pl.col("Courses").is_in(["Spark", "Python"]))
print("Filtered DataFrame:\n", df2)
Here,
pl.col("Courses")
: selects the"Courses"
column.is_in(["Spark", "Python"])
: checks if each value in that column is in the list["Spark", "Python"]
.df.filter(...)
: keeps only the rows where the condition isTrue
.
Use isin() with Numeric Column
You can also apply the isin()
method to numeric columns in Polars to filter rows based on whether the values in the column match any value in a given list of numbers. The process is the same as with string columns; simply pass a list of values and filter for rows where the numeric values are present in the list.
# Filter rows where 'Fees' is either 25000 or 24000
df2 = df.filter(pl.col("Fees").is_in([25000, 24000]))
print("Filtered DataFrame:\n", df2)
# Output:
# Filtered DataFrame:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Python ┆ 24000 ┆ 50days ┆ 1200 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Fees").is_in([25000, 24000])
: This checks if each value in theFees
column is either25000
or24000
.- The
filter()
method keeps only the rows where the condition isTrue
.
Combine with Multiple Conditions
You can combine multiple conditions in Polars using logical operators: &
for AND, |
for OR, and ~
for NOT. Each condition is written as an expression, such as pl.col("column") == value
, pl.col("column").is_in([...])
, or using comparison operators like >
, <
, and so on.
# Combine conditions with AND (&)
df2 = df.filter(
(pl.col("Fees").is_in([25000, 24000])) & (pl.col("Discount") > 2000)
)
print("Filtered DataFrame:\n", df2)
# Output:
# Filtered DataFrame:
# shape: (1, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
└─────────┴───────┴──────────┴──────────┘
Here,
(pl.col("Fees").is_in([25000, 24000]))
: filters rows where Fees match 25000 or 24000.(pl.col("Discount") > 2000)
: filters rows where Discount is above 2000.&
: ensures both conditions areTrue
.
Using isin() inside a with_columns() Expression
You can use isin()
within a with_columns()
expression to create a new column that indicates whether each value in a specific column exists in a given list. This allows you to add a column based on membership conditions.
# Add a new column using isin() inside with_columns()
df2 = df.with_columns(
(pl.col("Courses").is_in(["Python", "Pandas"])).alias("Is_Popular")
)
print("DataFrame with 'Is_Popular':\n", df2)
# Output:
# DataFrame with 'Is_Popular':
# shape: (5, 5)
┌─────────┬───────┬──────────┬──────────┬────────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount ┆ Is_Popular │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ bool │
╞═════════╪═══════╪══════════╪══════════╪════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 ┆ false │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 ┆ false │
│ Hadoop ┆ 20000 ┆ 40days ┆ 1500 ┆ false │
│ Python ┆ 24000 ┆ 50days ┆ 1200 ┆ true │
│ Pandas ┆ 26000 ┆ 40days ┆ 2500 ┆ true │
└─────────┴───────┴──────────┴──────────┴────────────┘
Here,
pl.col("Courses").is_in(["Python", "Pandas"])
: Returns a boolean Series indicating which rows match.alias("Is_Popular")
: Names the new column.with_columns(...)
: Adds it to the DataFrame.
Negate isin() to Exclude Values
To negate isin()
in Polars and exclude specific values, you can use the ~
(tilde) operator. This will filter out the rows where the condition is not true.
# Exclude specific courses
excluded_courses = ["Python", "Pandas"]
df2 = df.filter(
~pl.col("Courses").is_in(excluded_courses)
)
print("Courses excluding Python and Pandas:\n", df2)
# Output:
# Courses excluding Python and Pandas:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Hadoop ┆ 20000 ┆ 40days ┆ 1500 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Courses").is_in([...])
: Returns a boolean mask for matching rows.~
: Inverts the mask, returning rows NOT in the list.
Using isin() on Boolean Columns
The isin()
method also works with Boolean columns in Polars. It’s useful for filtering, marking, or manipulating rows based on whether a True
or False
value matches any value in a given list. This is particularly helpful when working with columns that contain Boolean values and you want to filter or tag specific conditions.
import polars as pl
# Sample DataFrame with a boolean column
data = {
'Courses': ["Spark", "PySpark", "Hadoop", "Python"],
'Fees' :[22000,25000,20000,26000],
"Is_Active": [True, False, True, False]
}
df = pl.DataFrame(data)
# Filter where Is_Active is in [True]
df2 = df.filter(pl.col("Is_Active").is_in([True]))
print("Filtered where Is_Active is True:\n", df2)
# Output:
# Filtered where Is_Active is True:
# shape: (2, 3)
┌─────────┬───────┬───────────┐
│ Courses ┆ Fees ┆ Is_Active │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════════╪═══════╪═══════════╡
│ Spark ┆ 22000 ┆ true │
│ Hadoop ┆ 20000 ┆ true │
└─────────┴───────┴───────────┘
Here,
pl.col("Is_Active").is_in([True])
: Filters only rows where"Is_Active"
isTrue
.- You can also do
is_in([False])
oris_in([True, False])
(which returns all rows).
Conclusion
In conclusion, the isin() method in Polars is a powerful tool for filtering, marking, and manipulating rows based on whether column values match any value in a specified list, Series, or another column. Whether you’re working with strings, numbers, or Boolean values, isin()
enables efficient membership checks, which can be combined with logical operators or used within expressions like with_columns()
for enhanced data manipulation.
Happy Learning!!
Related Articles
- Retrieve Date from DateTime Column in Polars
- Strip Entire Polars DataFrame
- How to Update the Polars DataFrame
- Mapping a Python Dict to a Polars Series
- How to Convert Struct to Series in Polars?
- How to Remove Duplicate Columns in Polars?
- Conditional Assignment in Polars DataFrame
- How to Transform a Series of a Polars DataFrame?
- Add a New Column into an Existing Polars DataFrame
- How to Effectively Create Duplicate Rows in Polars?
- Efficient way to Update a Single Element of a Polars DataFrame?
- How to Append a Python List to Another List (Series) of a Polars DataFrame?