In Polars, string manipulation on cell contents is achieved through the str
namespace, which is accessible on columns with the Utf8
(string) data type. This kind of manipulation involves performing operations that modify or analyze the text data stored within the cells of a Polars DataFrame or Series.
Essentially, Polars string manipulation of cell contents refers to efficiently inspecting, transforming, or analyzing the string values inside each cell of string-typed columns. In this article, I will explain the string manipulation of cell contents in polars.
Key Points –
- Polars string methods are vectorized, allowing fast and efficient operations on entire columns.
- String trimming methods let you strip whitespace or specified characters from the start and end of strings.
- Polars provides a
str
namespace to perform vectorized string operations on DataFrame columns. - You can convert strings to lowercase using
str.to_lowercase()
. - Strings can be converted to uppercase with
str.to_uppercase()
. - Leading and trailing whitespace can be removed using
str.strip_chars()
. - Substrings within strings can be replaced using
str.replace()
. - You can check for the presence of substrings using
str.contains()
, which returns a Boolean mask.
Usage of Polars String Manipulation of Cell Contents
Polars provides powerful, efficient, and easy-to-use string manipulation methods to clean, transform, and analyze textual data inside DataFrame columns. These methods are accessed through the str
namespace on string columns and work in a vectorized manner for high performance.
First, let’s create a Polars DataFrame.
import polars as pl
technologies= ({
'Courses':["Spark","Python","Spark","Python","Pandas"],
'Fees' :[22000,25000,22000,25000,24000],
'Duration':['30days','40days','60days','45days','50days'],
})
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To convert string columns to lowercase in a Polars DataFrame, apply the str.to_lowercase()
method to the string columns. For instance, to convert the values in the 'Courses'
and 'Duration'
columns to lowercase, you can use str.to_lowercase()
on those columns in Polars.
# Convert 'Courses' and 'Duration' columns to lowercase
df2 = df.with_columns([
pl.col("Courses").str.to_lowercase(),
pl.col("Duration").str.to_lowercase()
])
print("DataFrame with lowercase strings:\n", df2)
Here,
pl.col("Courses")
selects the columnCourses
.str.to_lowercase()
changes all the string values in that column to lowercase. Same is done forDuration
.with_columns()
replaces those columns with the updated lowercase versions.
Convert Strings to Uppercase
To convert string columns in a Polars DataFrame to uppercase, use the str.to_uppercase()
method on the relevant columns. For instance, you can apply this method to the 'Courses'
and 'Duration'
columns to transform their values to uppercase.
# Convert 'Courses' and 'Duration' columns to uppercase
df2 = df.with_columns([
pl.col("Courses").str.to_uppercase(),
pl.col("Duration").str.to_uppercase()
])
print("DataFrame with uppercase strings:\n", df2)
# Output:
# DataFrame with uppercase strings:
# shape: (5, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ SPARK ┆ 22000 ┆ 30DAYS │
│ PYTHON ┆ 25000 ┆ 40DAYS │
│ SPARK ┆ 22000 ┆ 60DAYS │
│ PYTHON ┆ 25000 ┆ 45DAYS │
│ PANDAS ┆ 24000 ┆ 50DAYS │
└─────────┴───────┴──────────┘
Here,
pl.col("Courses")
selects theCourses
column.str.to_uppercase()
converts all string values in the column to uppercase.with_columns()
applies these changes to the DataFrame.
Check if a String Contains a Substring
To check if a string contains a substring in Polars, you can use the str.contains()
method on a string column. This method returns a Boolean Series indicating whether each string contains the given substring or pattern.
# Check if 'Courses' contains 'spark' (case sensitive)
df2 = df.select(
pl.col("Courses").str.contains("Spark").alias("Contains_Spark")
)
print(df2)
# Output:
# shape: (5, 1)
┌────────────────┐
│ Contains_Spark │
│ --- │
│ bool │
╞════════════════╡
│ true │
│ false │
│ true │
│ false │
│ false │
└────────────────┘
Replace Substring
To replace a substring within string columns in Polars, use the str.replace() method on the target column. For example, to replace "Spark"
with "Flink"
in the 'Course'
column of a Polars DataFrame, you can apply str.replace()
accordingly.
# Replace 'Spark' with 'Flink' in the 'Courses' column
df = df.with_columns([
pl.col("Courses").str.replace("Spark", "Flink")
])
print("DataFrame after replacing part of strings:\n", df)
# Output:
# DataFrame after replacing part of strings:
# shape: (5, 3)
┌─────────┬───────┬──────────┐
│ Courses ┆ Fees ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════════╪═══════╪══════════╡
│ Flink ┆ 22000 ┆ 30days │
│ Python ┆ 25000 ┆ 40days │
│ Flink ┆ 22000 ┆ 60days │
│ Python ┆ 25000 ┆ 45days │
│ Pandas ┆ 24000 ┆ 50days │
└─────────┴───────┴──────────┘
Extract Substring (slice)
To extract a substring (slice) from strings in a Polars DataFrame column, you can use the str.slice()
method. This method allows you to specify the start position and optionally the length of the slice.
# Extract first 2 characters from 'Duration'
df2 = df.with_columns(
pl.col("Duration").str.slice(0, 2).alias("Duration_slice")
)
print(df2)
# Output:
# shape: (5, 4)
┌─────────┬───────┬──────────┬────────────────┐
│ Courses ┆ Fees ┆ Duration ┆ Duration_slice │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str │
╞═════════╪═══════╪══════════╪════════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 30 │
│ Python ┆ 25000 ┆ 40days ┆ 40 │
│ Spark ┆ 22000 ┆ 60days ┆ 60 │
│ Python ┆ 25000 ┆ 45days ┆ 45 │
│ Pandas ┆ 24000 ┆ 50days ┆ 50 │
└─────────┴───────┴──────────┴────────────────┘
Strip Whitespace from Both Ends
To remove whitespace from both ends of string columns in Polars, you can use the str.strip_chars() method without any arguments, as it defaults to stripping whitespace. This method can be applied to a Polars DataFrame column to trim spaces from the start and end of the string values.
import polars as pl
technologies = {
'Courses': [" Spark ", " Python ", " Spark", "Python ", " Pandas "],
'Fees': [22000, 25000, 22000, 25000, 24000],
'Duration': ['30days', '40days', '60days', '45days', '50days'],
}
df = pl.DataFrame(technologies)
# Strip whitespace from both ends of 'Courses' and 'Duration' columns
df2 = df.with_columns([
pl.col("Courses").str.strip_chars().alias("Courses_stripped"),
pl.col("Duration").str.strip_chars().alias("Duration_stripped")
])
print("DataFrame after stripping whitespace:\n", df2)
# Output:
# DataFrame after stripping whitespace:
# shape: (5, 5)
┌───────────┬───────┬──────────┬──────────────────┬───────────────────┐
│ Courses ┆ Fees ┆ Duration ┆ Courses_stripped ┆ Duration_stripped │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ str │
╞═══════════╪═══════╪══════════╪══════════════════╪═══════════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ Spark ┆ 30days │
│ Python ┆ 25000 ┆ 40days ┆ Python ┆ 40days │
│ Spark ┆ 22000 ┆ 60days ┆ Spark ┆ 60days │
│ Python ┆ 25000 ┆ 45days ┆ Python ┆ 45days │
│ Pandas ┆ 24000 ┆ 50days ┆ Pandas ┆ 50days │
└───────────┴───────┴──────────┴──────────────────┴───────────────────┘
Here,
str.strip_chars()
removes all leading and trailing whitespace characters (spaces, tabs, newlines) by default.- Applied here to both
Courses
andDuration
columns.
Conclusion
In summary, Polars simplifies text processing by providing str
methods that allow you to transform, clean, and analyze string columns efficiently. With functions like to_lowercase()
, to_uppercase()
, strip_chars()
, replace()
, and contains()
function.
Happy Learning!!
Related Articles
- Polars Replace String in Multiple Columns
- Polars DataFrame Columns Selection
- Polars Adding Days to a Date
- Strip Entire Polars DataFrame
- How to use isin in Polars DataFrame?
- How to Remove Duplicate Columns in Polars?
- Conditional Assignment in Polars DataFrame
- Retrieve Date from DateTime Column in Polars
- How to Effectively Create Duplicate Rows in Polars?
- Efficient way to Update a Single Element of a Polars DataFrame?
- How to Append a Python List to Another List (Series) of a Polars DataFrame?