• Post author:
  • Post category:Polars
  • Post last modified:April 2, 2025
  • Reading time:15 mins read
You are currently viewing Get First N Characters from a String Column in Polars

In Polars, extracting the first N characters from a string column means retrieving a substring that starts at the first character (index 0) and includes only the next N characters of each value. This ensures that only the initial part of the string is preserved. The str.slice() method in Polars allows you to extract a substring of a specified length from each string within a column. In this article, I will demonstrate how to get the first N characters from a string column in Polars.

Advertisements

Key Points 

  • Use str.slice(start, length) to extract the first N characters from a string column efficiently.
  • str.slice(0, N) ensures that extraction starts from the beginning of the string.
  • str.slice_chars(start, length) is optimized for character-based slicing, ensuring better performance.
  • str.extract(pattern) allows regex-based extraction, useful for pattern-matching scenarios.
  • To filter based on extracted values, combine str.slice() with filter().
  • Missing (null) values are preserved when using str.slice(), preventing errors.
  • To handle nulls, use fill_null(value) before applying str.slice().
  • Multiple columns can be processed simultaneously using select() or with_columns().

Usage of Get First N Characters from a string Column

Extracting the first N characters from a string column is a common text-processing task in Polars, often used for data cleaning, transformation, and feature engineering. Polars provides multiple methods for this, with the most efficient being str.slice(), str.substr(), and str.slice_chars(). If more flexibility is required, str.extract() can be used for regex-based extraction.

To run some examples of how to get the first N characters from a string column in polars, let’s create a Polars DataFrame.


import polars as pl

df = pl.DataFrame({
     "Courses":["Spark","PySpark","Hadoop","Python","Pandas"],
     'Duration':['30days','50days','35days', '40days','55days']
})
print("Original DataFrame:\n", df)

Yields below output.

Polars get N characters string column

You can use the str.slice(start, length) method to extract a specific portion of a string column. To extract the first two characters (numbers) from the "Duration" column, you can apply str.slice(0, 2).


# Extract the first 2 characters from "Duration"
df2 = df.with_columns(pl.col("Duration").str.slice(0, 2).alias("Duration_Num"))
print("After getting the first N characters:\n",df2)

Here,

  • str.slice(0, 2) extracts the first two characters, starting from index 0.
  • alias("Duration_Num") creates a new column "Duration_Num" with the extracted values.
Polars get N characters string column

Using str.extract() with Regex

You can use str.extract() with regular expressions (regex) to extract the first n characters from a string column. This function is particularly useful for extracting specific patterns from text data.


# Using str.extract() with regex
df2 = df.with_columns(
    pl.col("Courses").str.extract(r"^(.{3})").alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)

# Number of characters to extract
n = 3
# Extract first n characters using str.extract
pattern = f"^.{{0,{n}}}"
df2 = df.with_columns(
    pl.col("Courses").str.extract(pattern, 0).alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)

# Output:
# After getting the first N characters:
# shape: (5, 3)
┌─────────┬──────────┬───────────────┐
│ Courses ┆ Duration ┆ first_3_chars │
│ ---     ┆ ---      ┆ ---           │
│ str     ┆ str      ┆ str           │
╞═════════╪══════════╪═══════════════╡
│ Spark   ┆ 30days   ┆ Spa           │
│ PySpark ┆ 50days   ┆ PyS           │
│ Hadoop  ┆ 35days   ┆ Had           │
│ Python  ┆ 40days   ┆ Pyt           │
│ Pandas  ┆ 55days   ┆ Pan           │
└─────────┴──────────┴───────────────┘

Here,

  • r"^(\d{3})" is a regex pattern where:
    • ^ anchors the match to the start of the string.
    • \d{3} matches exactly three digits.
    • () captures the matched digits as a group.
  • str.extract(r"^(\d{3})") extracts the first three digits from "Courses".
  • alias("first_n_chars") creates a new column "first_n_chars".

Extracting First n Characters from Multiple Columns

To extract the first n characters from multiple columns in a Polars DataFrame, you can use string expression methods like str.slice_chars(), or str.extract() for regex-based extraction.


# Extract the first 3 characters from "Courses" and "Duration"
df2 = df.with_columns([
    df["Courses"].str.slice(0, 4).alias("Courses_Short"),
    df["Duration"].str.slice(0, 2).alias("Duration_Short")
])
print("After extracting the first n characters from multiple columns:\n",df2)

# Output:
# After extracting the first n characters from multiple columns:
# shape: (5, 4)
┌─────────┬──────────┬───────────────┬────────────────┐
│ Courses ┆ Duration ┆ Courses_Short ┆ Duration_Short │
│ ---     ┆ ---      ┆ ---           ┆ ---            │
│ str     ┆ str      ┆ str           ┆ str            │
╞═════════╪══════════╪═══════════════╪════════════════╡
│ Spark   ┆ 30days   ┆ Spar          ┆ 30             │
│ PySpark ┆ 50days   ┆ PySp          ┆ 50             │
│ Hadoop  ┆ 35days   ┆ Hado          ┆ 35             │
│ Python  ┆ 40days   ┆ Pyth          ┆ 40             │
│ Pandas  ┆ 55days   ┆ Pand          ┆ 55             │
└─────────┴──────────┴───────────────┴────────────────┘

Here,

  • str.slice(0, 4) extracts the first 4 characters.
  • alias("Courses_Short") renames the new column.
  • Using with_columns([]) allows adding multiple new columns at once.

Using String Expression Methods

Polars offers efficient String Expression Methods for extracting the first n characters from a column. The recommended approach is str.slice_chars(start, length), which is highly optimized. You can use these methods to manipulate text columns efficiently, with str.slice() enabling substring extraction from a string column.


# Number of characters to extract
n = 3

# Extract first n characters 
# Using string expression slice() method
df2 = df.with_columns(
    pl.col("Courses").str.slice(0, n).alias("first_n_chars")
)
print("After getting the first N characters:\n",df2)

# Output:
# After getting the first N characters:
# shape: (5, 3)
┌─────────┬──────────┬───────────────┐
│ Courses ┆ Duration ┆ first_n_chars │
│ ---     ┆ ---      ┆ ---           │
│ str     ┆ str      ┆ str           │
╞═════════╪══════════╪═══════════════╡
│ Spark   ┆ 30days   ┆ Spa           │
│ PySpark ┆ 50days   ┆ PyS           │
│ Hadoop  ┆ 35days   ┆ Had           │
│ Python  ┆ 40days   ┆ Pyt           │
│ Pandas  ┆ 55days   ┆ Pan           │
└─────────┴──────────┴───────────────┘

Here,

  • pl.col("Courses").str.slice(0, 3) extracts the first 3 characters from the "Courses" column.
  • alias("first_n_chars") creates a new column "first_n_chars" with the extracted values.

Getting First n Characters Using str.slice() and Filtering

You can use str.slice(start, length) to extract the first n characters from a string column and apply filtering based on the extracted values. This allows you to filter rows based on the extracted substring.


# Extract first 2 characters and create a new column
df = df.with_columns(
    pl.col("Duration").str.slice(0, 2).alias("Duration_Num")
)

# Convert extracted values 
# To integer and filter rows where Duration_Num > 40
result = df.with_columns(
    pl.col("Duration_Num").cast(pl.Int64)
).filter(pl.col("Duration_Num") > 40)
print(result)

# Output:
# shape: (2, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ ---     ┆ ---      ┆ ---          │
│ str     ┆ str      ┆ i64          │
╞═════════╪══════════╪══════════════╡
│ PySpark ┆ 50days   ┆ 50           │
│ Pandas  ┆ 55days   ┆ 55           │
└─────────┴──────────┴──────────────┘

Here,

  • Extract first 2 characters from "Duration" using str.slice(0, 2).
  • Store it as a new column named "Duration_Num".
  • Convert "Duration_Num" to an integer using cast(pl.Int64).
  • Filter rows where "Duration_Num" > 40".

Handling null or Missing Values

When working with text data, null (missing) values can cause issues if not handled properly. In Polars, missing values are represented as None or null. We can handle them using fill_null(), drop_nulls(), or apply conditional logic before extracting substrings.


import polars as pl

df = pl.DataFrame({
     "Courses": ["Spark", "PySpark", "Hadoop", "Python", None],
     "Duration": ["30days", "50days", None, "40days", "55days"]
})

# Extract first 2 characters from "Duration" column, handling null values
df2 = df.with_columns(
    pl.col("Duration").str.slice(0, 2).alias("Duration_Num"))
print(df2)

# Output:
# shape: (5, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ ---     ┆ ---      ┆ ---          │
│ str     ┆ str      ┆ str          │
╞═════════╪══════════╪══════════════╡
│ Spark   ┆ 30days   ┆ 30           │
│ PySpark ┆ 50days   ┆ 50           │
│ Hadoop  ┆ null     ┆ null         │
│ Python  ┆ 40days   ┆ 40           │
│ null    ┆ 55days   ┆ 55           │
└─────────┴──────────┴──────────────┘

Here,

  • Null values are preserved in "Duration_Num" without causing errors.

Filling Missing Values Before Extracting

When working with string data in Polars, missing values (null) can cause issues when applying string operations. One way to handle this is to fill missing values before extracting the first n characters using fill_null().


# Filling missing values before extracting
df2 = df.with_columns(
    pl.col("Duration").fill_null("Unknown").str.slice(0, 2).alias("Duration_Num"))
print(df2)

# Output:
# shape: (5, 3)
┌─────────┬──────────┬──────────────┐
│ Courses ┆ Duration ┆ Duration_Num │
│ ---     ┆ ---      ┆ ---          │
│ str     ┆ str      ┆ str          │
╞═════════╪══════════╪══════════════╡
│ Spark   ┆ 30days   ┆ 30           │
│ PySpark ┆ 50days   ┆ 50           │
│ Hadoop  ┆ null     ┆ Un           │
│ Python  ┆ 40days   ┆ 40           │
│ null    ┆ 55days   ┆ 55           │
└─────────┴──────────┴──────────────┘

Here,

  • Null values are replaced with "Unknown", ensuring the str.slice() function works without errors.

Dropping Rows with Missing Values

If we don’t want null values at all, we can remove rows with null values using drop_nulls().


# Dropping rows with missing values
df2 = df.drop_nulls()
print(df2)

# Output:
# shape: (3, 2)
┌─────────┬──────────┐
│ Courses ┆ Duration │
│ ---     ┆ ---      │
│ str     ┆ str      │
╞═════════╪══════════╡
│ Spark   ┆ 30days   │
│ PySpark ┆ 50days   │
│ Python  ┆ 40days   │
└─────────┴──────────┘

Here,

  • Removes all rows that contain null values.

Conclusion

In summary, Polars offers powerful and efficient string manipulation methods for extracting the first n characters from a column. The preferred approach is str.slice(), while str.extract() is useful for regex-based extraction. Choosing the right method ensures both efficiency and readability when handling text data in Polars.

Happy Learning!!

References