To convert a string column to an integer type in a Polars DataFrame, you can use the cast()
function. This method allows you to change the data type of a column. If you need to convert a column from Utf8
(string) to Int32
or Int64
, you can achieve this by applying cast()
to the column. This is particularly useful when the column contains numeric values stored as text, and you want to perform mathematical or comparison operations on them. In this article, I will explain how to cast a string to an integer.
Key Points –
- The
cast()
function is used to convert columns from one data type to another, including from string to integer. - Use
pl.Int32
,pl.Int64
, or other integer types depending on the size of the integer data. - The
cast()
method is used to cast columns to the desired type, e.g., casting a string column to an integer column. - You can cast multiple columns to integer types in a single operation using the
with_columns()
method. - Use
alias()
to create a new column with the desired name after casting. - Polars provides an efficient method for casting columns without creating intermediate copies, improving performance.
- The
cast()
method supports multiple integer types, includingInt32
,Int64
,Int8
,Int16
, and more. - You can select a column for casting using
pl.col('column_name')
before applying thecast()
function.
Usage of Polars cast string to integer
In Polars, the cast()
method is used to convert a column from one data type to another. When you need to cast a string column to an integer, you can use pl.Int32
, pl.Int64
, or other integer types, depending on the size of the values you’re working with.
To run some examples of converting Polars cast string to integer, let’s create a Polars DataFrame.
import polars as pl
technologies= ({
'Courses':["Spark","PySpark","Hadoop","Pandas"],
'Fee' :['22000','25000','24000','26000'],
'Duration':['30days','50days','40days','60days'],
'Discount':['1000','2300','2500','1400']})
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To convert a single column from a string to an integer in Polars, you can use the cast() method on that specific column. Below is an example where the Fee
column, which is currently in string format, is cast to an integer.
# Cast 'Fee' column from string to integer (Int32)
df2 = df.with_columns(pl.col('Fee').cast(pl.Int32).alias('Fee_int'))
print("DataFrame with Fee column casted to integer:\n", df2)
Here,
pl.col('Fee')
– Selects theFee
column.cast(pl.Int32)
– Converts theFee
column from a string type to an integer type (Int32).alias('Fee_int')
– Creates a new column namedFee_int
to store the casted integer values.
Cast Multiple String Columns to Integer
You can apply the cast()
method to each column and perform the transformation using with_columns(). In Polars, this approach allows you to cast multiple string columns to integers efficiently. Here’s how you can convert multiple columns, such as Fee
and Discount
, from strings to integers.
# Cast 'Fee' and 'Discount' columns from string to integer
df2 = df.with_columns([
pl.col("Fee").cast(pl.Int32).alias("Fee"),
pl.col("Discount").cast(pl.Int32).alias("Discount")])
print("Updated DataFrame with Fee and Discount as Integer:\n", df2)
# Output:
# Updated DataFrame with Fee and Discount as Integer:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ str ┆ i32 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 │
└─────────┴───────┴──────────┴──────────┘
Here,
pl.col("Fee").cast(pl.Int32)
– Casts theFee
column from string to integer (Int32
).pl.col("Discount").cast(pl.Int32)
– Casts theDiscount
column from string to integer (Int32
).with_columns()
– Applies both transformations in one step.
Cast String to Integer with pl.Int64
To cast a string column to an integer using pl.Int64
in Polars, you can modify the code to cast the "Fee"
and "Discount"
columns to Int64
instead of Int32
.
# Casting 'Fee' and 'Discount' columns to Int64
df2 = df.with_columns([
df["Fee"].cast(pl.Int64),
df["Discount"].cast(pl.Int64)])
print("DataFrame after casting to Int64:\n", df2)
# Output:
# DataFrame after casting to Int64:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 │
└─────────┴───────┴──────────┴──────────┘
Here,
- Using
pl.Int64
ensures that your numeric columns (like"Fee"
and"Discount"
) are cast to 64-bit integers, which is suitable for large numeric values. - The
"Courses"
, and"Duration"
columns remain a string as it contains non-numeric data.
Cast and Apply Mathematical Operations
To cast a string column to an integer and then apply various mathematical operations in Polars, you can use the cast()
function along with arithmetic operators (+
, -
, *
, /
, %
, **
). Here are a few examples demonstrating how to cast a string column to an integer and apply mathematical operations on it.
Cast String to Integer and Add a Value
To cast a string column to an integer and then add a specific value in Polars, you can use the cast()
function along with the addition operator (+
).
# Cast to Int64 and add 1000 to each value
df2 = df.with_columns((pl.col("Fee").cast(pl.Int64) + 1000).alias("Fee_plus_1000"))
print(df2)
# Output:
# shape: (4, 5)
┌─────────┬───────┬──────────┬──────────┬───────────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount ┆ Fee_plus_1000 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╪═══════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 ┆ 23000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 ┆ 26000 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 ┆ 25000 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 ┆ 27000 │
└─────────┴───────┴──────────┴──────────┴───────────────┘
Here,
pl.col("Fee").cast(pl.Int64)
– This casts the'Fee'
column from string to integers (Int64
).+ 1000
– Adds1000
to each value in the'Fee'
column.alias("Fee_plus_1000")
– Names the new column'Fee_plus_1000'
.
Cast String to Integer and Apply Exponentiation
To cast a string column to an integer and then apply exponentiation in Polars, you can use the cast()
function along with the **
operator (or pow()
function).
# Cast 'Fee' column to Int64 and apply exponentiation (square each value)
df2 = df.with_columns(
(pl.col("Fee").cast(pl.Int64) ** 2).alias("Squared_Value"))
print(df2)
# Output:
# shape: (4, 5)
┌─────────┬───────┬──────────┬──────────┬───────────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount ┆ Squared_Value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╪═══════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 ┆ 484000000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 ┆ 625000000 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 ┆ 576000000 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 ┆ 676000000 │
└─────────┴───────┴──────────┴──────────┴───────────────┘
Here,
pl.col("Fee").cast(pl.Int64)
– Casts the'Fee'
column from string to integer (Int64
).** 2
– Applies exponentiation (squares each value in this case).alias("Squared_Value")
– Renames the new column to'Squared_Value'
.
Cast String to Integer and Find the Modulus
To cast a string column to an integer and then find the modulus (remainder after division) in Polars, you can use the cast()
function along with the modulus operator (%
).
# Cast 'Fee' column to Int64 and find modulus when divided by 7000
df = df.with_columns((pl.col("Fee").cast(pl.Int64) % 7000).alias("Fee_Modulus_7000"))
print(df)
# Output:
# shape: (4, 5)
┌─────────┬───────┬──────────┬──────────┬──────────────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount ┆ Fee_Modulus_7000 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╪══════════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 ┆ 1000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 ┆ 4000 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 ┆ 3000 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 ┆ 5000 │
└─────────┴───────┴──────────┴──────────┴──────────────────┘
Here.
pl.col("Fee").cast(pl.Int64)
– Converts the'Fee'
column from string to integer (Int64
).% 7000
– Finds the remainder when each value is divided by7000
.alias("Fee_Modulus_7000")
– Renames the resulting column to'Fee_Modulus_7000'
.
Cast String Column and Rename
You can use the cast()
function to change the column type and the alias()
method to assign a new name to the converted column. This is how you cast a string column to an integer and rename it in Polars.
# Cast 'Fee' column to Int64 and rename it to 'Fee_in_int'
df2 = df.with_columns(pl.col("Fee").cast(pl.Int64).alias("Fee_in_int"))
print(df2)
# Output:
# shape: (4, 5)
┌─────────┬───────┬──────────┬──────────┬────────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount ┆ Fee_in_int │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞═════════╪═══════╪══════════╪══════════╪════════════╡
│ Spark ┆ 22000 ┆ 30days ┆ 1000 ┆ 22000 │
│ PySpark ┆ 25000 ┆ 50days ┆ 2300 ┆ 25000 │
│ Hadoop ┆ 24000 ┆ 40days ┆ 2500 ┆ 24000 │
│ Pandas ┆ 26000 ┆ 60days ┆ 1400 ┆ 26000 │
└─────────┴───────┴──────────┴──────────┴────────────┘
Here,
pl.col("Fee").cast(pl.Int64)
– Casts the'Fee'
column from string to integers (Int64
).alias("Fee_in_int")
– Renames the resulting column to'Fee_in_int'
.
Cast String to Integer and Handle Missing Values
When casting a string column to an integer in Polars, missing or invalid values (such as empty strings or non-numeric values) can cause errors. To handle such cases, you can use pl.col().str.to_integer()
or apply fill_null()
to replace missing values.
import polars as pl
# Sample DataFrame with missing and invalid string values
technologies = {
'Courses': ["Spark", "PySpark", "Hadoop", "Pandas"],
'Fee': ['22000', '25000', None, '26000'], # Contains a missing value (None)
'Discount': ['1000', '2300', 'invalid', '1400'] # Contains an invalid string ("invalid")}
df = pl.DataFrame(technologies)
# Cast 'Fee' and 'Discount' columns to integer, handling missing/invalid values
df2 = df.with_columns([
pl.col('Fee').cast(pl.Int32, strict=False).fill_null(0).alias('Fee_int'),
pl.col('Discount').cast(pl.Int32, strict=False).fill_null(0).alias('Discount_int')])
print(df2)
# Output:
# shape: (4, 5)
┌─────────┬───────┬──────────┬─────────┬──────────────┐
│ Courses ┆ Fee ┆ Discount ┆ Fee_int ┆ Discount_int │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i32 ┆ i32 │
╞═════════╪═══════╪══════════╪═════════╪══════════════╡
│ Spark ┆ 22000 ┆ 1000 ┆ 22000 ┆ 1000 │
│ PySpark ┆ 25000 ┆ 2300 ┆ 25000 ┆ 2300 │
│ Hadoop ┆ null ┆ invalid ┆ 0 ┆ 0 │
│ Pandas ┆ 26000 ┆ 1400 ┆ 26000 ┆ 1400 │
└─────────┴───────┴──────────┴─────────┴──────────────┘
Conclusion
In summary, converting a string column to an integer type in Polars is easy with the cast()
method. This transformation is essential for data preparation, particularly when numbers are stored as strings. By using cast(pl.Int32)
or other integer types like Int64
, you ensure the column is properly formatted for numerical computations or further data manipulation.
Happy Learning!!
Related Articles
- Convert Polars Cast Int to String
- Convert Polars Cast String to Float
- Convert Polars Cast Float to String
- Polars DataFrame drop() Method
- How to Transpose DataFrame in Polars
- Polars DataFrame select() Method
- Polars DataFrame.rename() Method
- Add New Columns to Polars DataFrame