To replace a string in multiple columns of a Polars DataFrame, you can use the str.replace() or str.replace_all() methods for each column, and combine them with the with_columns() method to apply the changes to multiple columns simultaneously.
Replace string in multiple columns in Polars refers to the process of replacing specific substrings, patterns, or values within multiple columns of a Polars DataFrame. This operation is often part of data cleaning or preprocessing, where you might want to standardize, modify, or remove unwanted text across several columns at once. In this article, I will explain how the polars replace strings in multiple columns.
Key Points –
- Use
with_columns()with a list comprehension to apply string replacements across multiple columns. - Apply
str.replace()to replace the first occurrence of a substring in each string. - Use
str.replace_all()to replace all occurrences of a substring or regex pattern. - Regex patterns can be used with
str.replace_all()for advanced string manipulation like removing digits or special characters. - Regular expressions are supported in both
str.replace()andstr.replace_all()functions. - When using
str.replaceorstr.replace_all, ensure the columns are of string (Utf8) type. - You can chain multiple
str.replace()calls for multiple replacements in a single column. - If a substring doesn’t exist in a column,
str.replace()safely leaves the value unchanged.
Usage of Polars Replace String in Multiple Columns
To replace a string in multiple columns of a Polars DataFrame, you can use the with_columns() method in combination with str.replace() or str.replace_all() for string manipulation. This allows you to apply the replacement across specific columns of your DataFrame.
First, let’s create a Polars DataFrame.
import polars as pl
technologies= ({
'Courses':["spark","python","spark","python","pandas"],
'Fees' :[22000,25000,22000,25000,24000],
'Duration':['30days','40days','60days','45days','50days'],
})
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To replace a substring in multiple columns using Polars, you can use the str.replace() function inside with_columns() for each column you want to update. For instance, you might replace "days" with "day(s)" in the Duration column, or change "python" to "Python" in the Courses column.
# Replace in multiple columns
df2 = df.with_columns([
pl.col("Courses").str.replace("python", "Python"),
pl.col("Duration").str.replace("days", "day(s)")
])
print("Modified DataFrame:\n", df2)
Here,
str.replace("python", "Python"): This replaces the lowercase"python"with"Python"in the"Courses"column.str.replace("days", "day(s)"): This replaces the substring"days"with"day(s)"in the"Duration"column.
Alternatively, to replace a substring in two columns of a Polars DataFrame, you can apply the str.replace_all() method to each column individually. For instance, you might replace the substring "days" in the Duration column and "python" in the Courses column.
# Replace 'days' in Duration column and 'python' in Courses column
df2 = df.with_columns([
pl.col("Duration").str.replace_all("days", "D"), # Replace 'days' with 'D'
pl.col("Courses").str.replace_all("python", "py") # Replace 'python' with 'py'
])
print("Modified DataFrame:\n", df2)
# Output:
# Modified DataFrame:
# shape: (5, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ spark ┆ 22000 ┆ 30D │
│ py ┆ 25000 ┆ 40D │
│ spark ┆ 22000 ┆ 60D │
│ py ┆ 25000 ┆ 45D │
│ pandas ┆ 24000 ┆ 50D │
└─────────┴───────┴──────────┘
Here,
- In the
Durationcolumn, we replaced"days"with"D". - In the
Coursescolumn, we replaced"python"with"py".
Replace a Specific Word with Another in Multiple Columns
To replace a specific word with another across multiple columns in Polars, you can use with_columns() combined with a list comprehension to apply str.replace() to each target column.
# Columns to apply replacement
columns_to_modify = ["Courses", "Duration"]
# Replace words in multiple columns
df2= df.with_columns([
pl.col(col).str.replace("spark", "Spark (Big Data)").str.replace("days", " Day(s)") for col in columns_to_modify
])
print("Modified DataFrame:\n", df2)
# Output:
# Modified DataFrame:
# shape: (5, 3)
┌──────────────────┬───────┬───────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞══════════════════╪═══════╪═══════════╡
│ Spark (Big Data) ┆ 22000 ┆ 30 Day(s) │
│ python ┆ 25000 ┆ 40 Day(s) │
│ Spark (Big Data) ┆ 22000 ┆ 60 Day(s) │
│ python ┆ 25000 ┆ 45 Day(s) │
│ pandas ┆ 24000 ┆ 50 Day(s) │
└──────────────────┴───────┴───────────┘
Here,
str.replace()is chainable, so you can apply multiple replacements.- Using a list comprehension allows you to update multiple columns at once, making it an efficient approach for data cleaning and normalization.
Remove Numbers from Multiple Columns
To remove numbers from multiple columns in a Polars DataFrame, you can use the str.replace() method with a regular expression. The regular expression pattern \d+ matches one or more digits, and you can replace those digits with an empty string.
# Remove numbers from 'Courses' and 'Duration' columns
df2 = df.with_columns([
pl.col(col).str.replace(r"\d+", "") for col in ["Courses", "Duration"]
])
print("Updated DataFrame:\n", df2)
# Output:
# Updated DataFrame:
# shape: (5, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ spark ┆ 22000 ┆ days │
│ python ┆ 25000 ┆ days │
│ spark ┆ 22000 ┆ days │
│ python ┆ 25000 ┆ days │
│ pandas ┆ 24000 ┆ days │
└─────────┴───────┴──────────┘
Here,
pl.col(col)selects the column.str.replace(r"\d+", "")removes all digits (\d+) and replaces them with an empty string ("").with_columns([...])applies the change to bothCoursesandDurationcolumns.
Replace a Substring with an Empty String in Specific Columns
To replace a substring with an empty string in specific columns using Polars, you can use str.replace_all() (or str.replace() if it’s just once) to remove the substring by replacing it with "".
# Replace substrings with empty string in specific columns
df2 = df.with_columns([
pl.col("Courses").str.replace_all("python", ""),
pl.col("Duration").str.replace_all("days", "")
])
print("Updated DataFrame:\n", df2)
# Output:
# Updated DataFrame:
# shape: (5, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ spark ┆ 22000 ┆ 30 │
│ ┆ 25000 ┆ 40 │
│ spark ┆ 22000 ┆ 60 │
│ ┆ 25000 ┆ 45 │
│ pandas ┆ 24000 ┆ 50 │
└─────────┴───────┴──────────┘
Here,
- If the substring doesn’t exist in some rows, it’s safely ignored.
str.replace_all()removes all occurrences of the substring.
Conclusion
In conclusion, replacing substrings across multiple columns in Polars is straightforward and efficient using with_columns() along with string methods like str.replace() or str.replace_all(). Whether you’re removing unwanted text, standardizing formats, or stripping out digits, using list comprehensions with regular expressions helps you apply changes consistently across your DataFrame.
Happy Learning!!
Related Articles
- Polars DataFrame Columns Selection
- Polars Adding Days to a Date
- Strip Entire Polars DataFrame
- How to use isin in Polars DataFrame?
- Mapping a Python Dict to a Polars Series
- How to Remove Duplicate Columns in Polars?
- Conditional Assignment in Polars DataFrame
- Retrieve Date from DateTime Column in Polars
- Add a New Column into an Existing Polars DataFrame
- How to Effectively Create Duplicate Rows in Polars?
- Efficient way to Update a Single Element of a Polars DataFrame?
- How to Append a Python List to Another List (Series) of a Polars DataFrame?