In Polars, extracting the first N characters from a string column means retrieving a substring that starts at the first character (index 0) and includes only the next N characters of each value. This ensures that only the initial part of the string is preserved. The str.slice() method in Polars allows you to extract a substring of a specified length from each string within a column. In this article, I will demonstrate how to get the first N characters from a string column in Polars.
Key Points –
- Use
str.slice(start, length)to extract the first N characters from a string column efficiently. str.slice(0, N)ensures that extraction starts from the beginning of the string.str.slice_chars(start, length)is optimized for character-based slicing, ensuring better performance.str.extract(pattern)allows regex-based extraction, useful for pattern-matching scenarios.- To filter based on extracted values, combine
str.slice()withfilter(). - Missing (
null) values are preserved when usingstr.slice(), preventing errors. - To handle nulls, use
fill_null(value)before applyingstr.slice(). - Multiple columns can be processed simultaneously using
select()orwith_columns().
Usage of Get First N Characters from a string Column
Extracting the first N characters from a string column is a common text-processing task in Polars, often used for data cleaning, transformation, and feature engineering. Polars provides multiple methods for this, with the most efficient being str.slice(), str.substr(), and str.slice_chars(). If more flexibility is required, str.extract() can be used for regex-based extraction.
To run some examples of how to get the first N characters from a string column in polars, let’s create a Polars DataFrame.
import polars as pl
df = pl.DataFrame({
"Courses":["Spark","PySpark","Hadoop","Python","Pandas"],
'Duration':['30days','50days','35days', '40days','55days']
})
print("Original DataFrame:\n", df)
Yields below output.
You can use the str.slice(start, length) method to extract a specific portion of a string column. To extract the first two characters (numbers) from the "Duration" column, you can apply str.slice(0, 2).
# Extract the first 2 characters from "Duration"
df2 = df.with_columns(pl.col("Duration").str.slice(0, 2).alias("Duration_Num"))
print("After getting the first N characters:\n",df2)
Here,
str.slice(0, 2)extracts the first two characters, starting from index 0.alias("Duration_Num")creates a new column"Duration_Num"with the extracted values.

Using str.extract() with Regex
You can use str.extract() with regular expressions (regex) to extract the first n characters from a string column. This function is particularly useful for extracting specific patterns from text data.
# Using str.extract() with regex
df2 = df.with_columns(
pl.col("Courses").str.extract(r"^(.{3})").alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)
# Number of characters to extract
n = 3
# Extract first n characters using str.extract
pattern = f"^.{{0,{n}}}"
df2 = df.with_columns(
pl.col("Courses").str.extract(pattern, 0).alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)
# Output:
# After getting the first N characters:
# shape: (5, 3)
┌─────────┬──────────┬───────────────┐
│ Courses ┆ Duration ┆ first_3_chars │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪══════════╪═══════════════╡
│ Spark ┆ 30days ┆ Spa │
│ PySpark ┆ 50days ┆ PyS │
│ Hadoop ┆ 35days ┆ Had │
│ Python ┆ 40days ┆ Pyt │
│ Pandas ┆ 55days ┆ Pan │
└─────────┴──────────┴───────────────┘
Here,
r"^(\d{3})"is a regex pattern where:^anchors the match to the start of the string.\d{3}matches exactly three digits.()captures the matched digits as a group.
str.extract(r"^(\d{3})")extracts the first three digits from"Courses".alias("first_n_chars")creates a new column"first_n_chars".
Extracting First n Characters from Multiple Columns
To extract the first n characters from multiple columns in a Polars DataFrame, you can use string expression methods like str.slice_chars(), or str.extract() for regex-based extraction.
# Extract the first 3 characters from "Courses" and "Duration"
df2 = df.with_columns([
df["Courses"].str.slice(0, 4).alias("Courses_Short"),
df["Duration"].str.slice(0, 2).alias("Duration_Short")
])
print("After extracting the first n characters from multiple columns:\n",df2)
# Output:
# After extracting the first n characters from multiple columns:
# shape: (5, 4)
┌─────────┬──────────┬───────────────┬────────────────┐
│ Courses ┆ Duration ┆ Courses_Short ┆ Duration_Short │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═════════╪══════════╪═══════════════╪════════════════╡
│ Spark ┆ 30days ┆ Spar ┆ 30 │
│ PySpark ┆ 50days ┆ PySp ┆ 50 │
│ Hadoop ┆ 35days ┆ Hado ┆ 35 │
│ Python ┆ 40days ┆ Pyth ┆ 40 │
│ Pandas ┆ 55days ┆ Pand ┆ 55 │
└─────────┴──────────┴───────────────┴────────────────┘
Here,
str.slice(0, 4)extracts the first 4 characters.alias("Courses_Short")renames the new column.- Using
with_columns([])allows adding multiple new columns at once.
Using String Expression Methods
Polars offers efficient String Expression Methods for extracting the first n characters from a column. The recommended approach is str.slice_chars(start, length), which is highly optimized. You can use these methods to manipulate text columns efficiently, with str.slice() enabling substring extraction from a string column.
# Number of characters to extract
n = 3
# Extract first n characters
# Using string expression slice() method
df2 = df.with_columns(
pl.col("Courses").str.slice(0, n).alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)
# Output:
# After getting the first N characters:
# shape: (5, 3)
┌─────────┬──────────┬───────────────┐
│ Courses ┆ Duration ┆ first_n_chars │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪══════════╪═══════════════╡
│ Spark ┆ 30days ┆ Spa │
│ PySpark ┆ 50days ┆ PyS │
│ Hadoop ┆ 35days ┆ Had │
│ Python ┆ 40days ┆ Pyt │
│ Pandas ┆ 55days ┆ Pan │
└─────────┴──────────┴───────────────┘
Here,
pl.col("Courses").str.slice(0, 3)extracts the first3characters from the"Courses"column.alias("first_n_chars")creates a new column"first_n_chars"with the extracted values.
Getting First n Characters Using str.slice() and Filtering
You can use str.slice(start, length) to extract the first n characters from a string column and apply filtering based on the extracted values. This allows you to filter rows based on the extracted substring.
# Extract first 2 characters and create a new column
df = df.with_columns(
pl.col("Duration").str.slice(0, 2).alias("Duration_Num")
)
# Convert extracted values
# To integer and filter rows where Duration_Num > 40
result = df.with_columns(
pl.col("Duration_Num").cast(pl.Int64)
).filter(pl.col("Duration_Num") > 40)
print(result)
# Output:
# shape: (2, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪══════════╪══════════════╡
│ PySpark ┆ 50days ┆ 50 │
│ Pandas ┆ 55days ┆ 55 │
└─────────┴──────────┴──────────────┘
Here,
- Extract first 2 characters from
"Duration"usingstr.slice(0, 2). - Store it as a new column named
"Duration_Num". - Convert
"Duration_Num"to an integer usingcast(pl.Int64). - Filter rows where
"Duration_Num" > 40".
Handling null or Missing Values
When working with text data, null (missing) values can cause issues if not handled properly. In Polars, missing values are represented as None or null. We can handle them using fill_null(), drop_nulls(), or apply conditional logic before extracting substrings.
import polars as pl
df = pl.DataFrame({
"Courses": ["Spark", "PySpark", "Hadoop", "Python", None],
"Duration": ["30days", "50days", None, "40days", "55days"]
})
# Extract first 2 characters from "Duration" column, handling null values
df2 = df.with_columns(
pl.col("Duration").str.slice(0, 2).alias("Duration_Num"))
print(df2)
# Output:
# shape: (5, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪══════════╪══════════════╡
│ Spark ┆ 30days ┆ 30 │
│ PySpark ┆ 50days ┆ 50 │
│ Hadoop ┆ null ┆ null │
│ Python ┆ 40days ┆ 40 │
│ null ┆ 55days ┆ 55 │
└─────────┴──────────┴──────────────┘
Here,
- Null values are preserved in
"Duration_Num"without causing errors.
Filling Missing Values Before Extracting
When working with string data in Polars, missing values (null) can cause issues when applying string operations. One way to handle this is to fill missing values before extracting the first n characters using fill_null().
# Filling missing values before extracting
df2 = df.with_columns(
pl.col("Duration").fill_null("Unknown").str.slice(0, 2).alias("Duration_Num"))
print(df2)
# Output:
# shape: (5, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪══════════╪══════════════╡
│ Spark ┆ 30days ┆ 30 │
│ PySpark ┆ 50days ┆ 50 │
│ Hadoop ┆ null ┆ Un │
│ Python ┆ 40days ┆ 40 │
│ null ┆ 55days ┆ 55 │
└─────────┴──────────┴──────────────┘
Here,
- Null values are replaced with
"Unknown", ensuring thestr.slice()function works without errors.
Dropping Rows with Missing Values
If we don’t want null values at all, we can remove rows with null values using drop_nulls().
# Dropping rows with missing values
df2 = df.drop_nulls()
print(df2)
# Output:
# shape: (3, 2)
┌─────────┬──────────┐
│ Courses ┆ Duration │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪══════════╡
│ Spark ┆ 30days │
│ PySpark ┆ 50days │
│ Python ┆ 40days │
└─────────┴──────────┘
Here,
- Removes all rows that contain
nullvalues.
Conclusion
In summary, Polars offers powerful and efficient string manipulation methods for extracting the first n characters from a column. The preferred approach is str.slice(), while str.extract() is useful for regex-based extraction. Choosing the right method ensures both efficiency and readability when handling text data in Polars.
Happy Learning!!
Related Articles
- How to Select Last Column of Polars DataFrame
- Polars Filter by Column Value
- Convert Polars String to Integer
- Polars Rename Columns to Lowercase
- Polars Sum DataFrame Columns With Examples
- How to Change Position of a Column in Polars
- Reorder Columns in a Specific Order Using Polars
- How to Convert a Polars DataFrame to Python List?
- Append or Concatenate Two DataFrames in Polars
- How to Add a Column with Numerical Value in Polars