In Polars, you can drop rows using the filter()
method, which lets you select rows that satisfy a particular condition. To remove specific rows, you apply a condition that excludes the rows you want to drop. Since Polars doesn’t provide a direct drop()
method like pandas, you generally filter the rows based on certain conditions. Alternatively, operations like drop_nulls()
, filter()
, and unique()
can also be used to remove unwanted rows. In this article, I will explain the different methods to drop rows in a Polars DataFrame.
Key Points –
- Use
filter()
to remove rows based on specific conditions or criteria. - Use conditions like
!=
,>
,<
, etc., to specify which rows to drop. - Use
drop_nulls()
to eliminate rows containing missing (null) values in one or more columns. - Combine conditions with logical operators (
&
,|
) to drop rows that meet multiple criteria. - Use the
.unique()
method to drop rows that are duplicates across all columns. - Use
.is_in()
to filter out rows based on a list of values. - You can drop rows if a column value meets a specific condition (e.g., greater than a threshold).
- Polars operations like
filter()
do not modify the original DataFrame, so assign the result to a new variable or overwrite the existing one.
Create a Polars DataFrame
Creating a Polars DataFrame is simple and resembles the process of creating a DataFrame in Pandas. Polars offers a versatile API that allows you to build DataFrames from various data structures, including dictionaries, lists of lists, and NumPy arrays.
Let’s start by creating a basic DataFrame using Polars.
import polars as pl
# Creating a new Polars DataFrame
technologies = {
'Courses': ["Spark", "Pandas", "Hadoop", "Python", "Pandas", "Spark"],
'Fees': [22000, 26000, 25000, 20000, 26000, 22000],
'Duration': ['30days', '60days', '50days', '40days', '60days', '30days'],
'Discount': [1000, 200, 1500, 1200, 2000, 1000]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
Dropping Rows Based on Condition
In Polars, you can drop rows based on a condition using the filter()
method. This method enables you to exclude rows that don’t satisfy the specified condition, effectively removing them from the DataFrame.
Drop Rows Where a Column Value is Equal to a Given Value
To drop rows where a column value is equal to a given value in Polars, you can use the filter()
function and exclude the rows that match the specified condition.
# Drop rows where 'Courses' is "Pandas"
df2 = df.filter(pl.col("Courses") != "Pandas")
print(df2)
Here,
pl.col("Courses") != "Pandas"
filters out rows where theCourses
column is equal to “Pandas”- The
filter()
method keeps rows that meet the condition (not equal to “Pandas”).
Drop Rows Where a Numeric Column is Below a Threshold
To drop rows where a numeric column is below a threshold in Polars, you can use the filter()
function and specify a condition that retains only the rows where the column value is greater than or equal to the threshold.
# Drop rows where 'Fees' is below 25000
df2 = df.filter(pl.col("Fees") >= 25000)
print(df2)
# Output:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Fees") >= 25000
: This condition keeps rows where the value in theFees
column is greater than or equal to 25000.- The
filter()
function excludes rows that don’t meet this condition. - In this example, the rows where
Fees
was less than 25000 are dropped, and only the rows where theFees
are greater than or equal to 25000 are retained.
Drop Rows Based on Multiple Column Conditions
Dropping rows based on multiple column conditions in Polars involves using the filter()
method along with logical operators such as &
(AND), |
(OR), and ~
(NOT).
# Drop rows where 'Courses' is "Pandas" or 'Fees' is greater than 25000
df2 = df.filter(~((pl.col("Courses") == "Pandas") | (pl.col("Fees") > 22000)))
print(df2)
# Output:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Python ┆ 20000 ┆ 40days ┆ 1200 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Courses") == "Pandas"
checks if theCourses
column has the value “Pandas”.pl.col("Fees") > 22000
checks if theFees
column is greater than 22000.|
combines these conditions with a logical OR, meaning either condition being true is enough to exclude the row.~
negates the condition to drop the rows that meet it.
Drop Rows Where a Column Value Exists in a Specific List
To drop rows where a column value exists in a specific list in Polars, you can use the .is_in()
method along with a boolean negation (~
).
# Define the list of values to exclude
values_to_exclude = ["Pandas", "Python", "Hadoop"]
# Drop rows where 'Courses' column value exists in the list
df2 = df.filter(~df['Courses'].is_in(values_to_exclude))
print(df2)
# Output:
# shape: (2, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
If you need to filter rows based on multiple columns being in specific lists, you can combine conditions.
# Define lists for multiple columns
courses_to_exclude = ["Pandas", "Python"]
fees_to_exclude = [26000, 30000]
# Drop rows where 'Courses' or 'Fees' match the respective lists
df2 = df.filter(~df['Courses'].is_in(courses_to_exclude) & ~df['Fees'].is_in(fees_to_exclude))
print(df2)
# Output:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
Dropping Duplicate Rows
To drop duplicate rows in Polars, you can use the unique() method. This method removes duplicate rows from the DataFrame based on all columns or a subset of columns.
# Drop all duplicate rows
df2 = df.unique()
print(df2)
# Output:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Python ┆ 20000 ┆ 40days ┆ 1200 │
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
If you want to consider only specific columns when identifying duplicates, you can pass the column names to the subset
parameter.
# Drop duplicates based on the 'Courses' column
df2 = df.unique(subset=["Courses"])
print(df2)
# Output:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Hadoop ┆ 25000 ┆ 50days ┆ 1500 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Python ┆ 20000 ┆ 40days ┆ 1200 │
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
Dropping Rows with Missing Values
To drop rows with missing values in Polars, you can use the drop_nulls()
method. This method removes rows that contain null values in any or specific columns.
import polars as pl
# Create a Polars DataFrame with missing values
df = pl.DataFrame({
'Courses': ["Spark", "Pandas", None, "Python", "Pandas", "Spark"],
'Fees': [22000, 26000, 25000, None, 26000, 22000],
'Duration': ['30days', None, '50days', '40days', '60days', '30days'],
'Discount': [1000, 2000, None, 1200, 2000, 1000]
})
# Drop rows with missing values in any column
df2 = df.drop_nulls()
print(df2)
# Output:
# shape: (3, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
If you want to drop rows where specific columns have null values, pass the column names to the subset
parameter.
# Drop rows with missing values in 'Fees' and 'Duration' columns
df2 = df.drop_nulls(subset=['Fees', 'Duration'])
print(df2)
# Output:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fees ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ null ┆ 25000 ┆ 50days ┆ null │
│ Pandas ┆ 26000 ┆ 60days ┆ 2000 │
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
└─────────┴───────┴──────────┴──────────┘
Summary of Methods
Scenario | Method | Example |
---|---|---|
Drop rows by condition | .filter() | df.filter(df[‘Fees’] > 25000) |
Drop rows by index | .slice() or .with_row_count() | df.slice(2, len(df)-2) |
Drop duplicate rows | .unique() | df.unique() |
Drop rows with missing values | .drop_nulls() | df.drop_nulls() |
Drop rows where value in a list | .is_in() | df.filter(~df[‘Courses’].is_in(lst)) |
Drop rows based on string length | .str.lengths() | df.filter(df[‘Courses’].str.lengths()) |
Drop rows by compound conditions | Logical operators with .filter() | df.filter((cond1) & (cond2)) |
Conclusion
In summary, dropping rows in Polars is a simple and flexible process that facilitates various data-cleaning tasks. With methods like drop_nulls()
, filter()
, and unique()
, you can efficiently eliminate rows based on criteria such as null values, duplicates, or specific column conditions.
Happy Learning!!
Related Articles
- Polars DataFrame drop() Method
- Add New Columns to Polars DataFrame
- Polars DataFrame select() Method
- Polars Cast Multiple Columns
- Polars DataFrame.sort() Method
- Polars DataFrame.explode() Method
- Convert Polars Cast String to Float
- Polars DataFrame.melt() Method