In Polars, you can cast multiple columns to different data types by using the select()
or with_columns()
method along with the pl.col()
expression and the cast()
function. By combining these, you can specify both the columns and the desired target types for conversion in a Polars DataFrame. In this article, I will explain the polars cast multiple columns.
Key Points –
- Casting allows you to change the data type of one or more columns in a Polars DataFrame, ensuring consistency and enabling specific operations.
- The
with_columns
method is typically used to apply transformations, including type casting, to multiple columns at once. - Use
pl.col("<column_name>").cast(<data_type>)
to specify the column and its target data type. - Pass a list of transformations to
with_columns
to cast multiple columns simultaneously. - Common data types include
pl.Int32
,pl.Int64
,pl.Float32
,pl.Float64
,pl.Boolean
,pl.Utf8
(string), andpl.Date
. - The
with_columns()
method can be seamlessly chained with other DataFrame operations for streamlined workflows. - Columns can be selected by name (
pl.col("name")
), by data type, or by a condition for dynamic transformations.
Usage of Polars Cast Multiple Columns
Casting multiple columns in Polars involves transforming the data types of selected columns in a DataFrame to match the desired format. This operation is crucial for data cleaning, preprocessing, and ensuring compatibility with downstream analysis or storage systems.
First, let’s create a Polars DataFrame.
import polars as pl
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000, 30000],
'Duration':['30days','40days','35days','50days'],
'Discount':[1000,2300,1200,2000]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To cast the numeric columns (such as 'Fee'
and 'Discount'
) to floats in your Polars DataFrame, you can use the cast() function within the with_columns()
method.
# Cast 'Fee' and 'Discount' columns to Float64
df2 = df.with_columns([
pl.col("Fee").cast(pl.Float64), # Cast 'Fee' to Float64
pl.col("Discount").cast(pl.Float64) # Cast 'Discount' to Float64
])
print("DataFrame with Casted Columns:\n", df2)
Here,
- We used
pl.col("Fee").cast(pl.Float64)
to cast the'Fee'
column toFloat64
andpl.col("Discount").cast(pl.Float64)
for the'Discount'
column. - The
Duration
andCourses
columns remain unaffected as they are ofstr
type.
Cast Mixed Columns to Strings
To cast mixed columns (columns with different data types) to strings in Polars, you can use the cast(pl.Utf8)
method to convert them into string format.
# Cast 'Fee' and 'Discount' columns to strings
df2 = df.with_columns([
pl.col("Fee").cast(pl.Utf8), # Cast 'Fee' to String (Utf8)
pl.col("Discount").cast(pl.Utf8) # Cast 'Discount' to String (Utf8)
])
print("DataFrame with Casted Columns:\n", df2)
# Output:
# DataFrame with Casted Columns:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 20000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 40days ┆ 2300 │
│ Python ┆ 22000 ┆ 35days ┆ 1200 │
│ pandas ┆ 30000 ┆ 50days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
Here,
- The
pl.col("Fee").cast(pl.Utf8)
casts the'Fee'
column to a string (Utf8
), and the same is done for the'Discount'
column. - The
'Courses'
and'Duration'
columns already contain string values, so they remain unchanged.
Cast Multiple Columns Dynamically
To cast multiple columns dynamically based on a specific condition (e.g., casting all numeric columns to strings), you can loop through the columns and apply the cast operation conditionally.
# Dynamically cast all numeric columns to strings (Utf8)
df2 = df.select([
pl.col(col).cast(pl.Utf8) if df[col].dtype in [pl.Int64, pl.Float64] else pl.col(col)
for col in df.columns
])
print("DataFrame with Dynamically Casted Columns:\n", df2)
Here,
- We loop through all the columns in the DataFrame using
df.columns
. - For each column, we check if its data type is numeric (
Int64
orFloat64
) usingdf[col].dtype
. If it is numeric, we cast it to a string (Utf8
), otherwise, we leave it unchanged. pl.col(col)
selects each column, andcast(pl.Utf8)
is applied conditionally.
Yields same output as above.
Cast Columns to Unsigned Integers
To cast columns to unsigned integers, you can use the cast()
method and specify an unsigned integer type, such as UInt8
, UInt16
, UInt32
, or UInt64
. This is particularly useful when you are working with columns that contain non-negative values and want to ensure the data type is unsigned.
# Cast 'Fee' and 'Discount' columns
# To unsigned integers (UInt32)
df2 = df.with_columns([
pl.col("Fee").cast(pl.UInt32), # Cast 'Fee' to UInt32 (unsigned int)
pl.col("Discount").cast(pl.UInt32) # Cast 'Discount' to UInt32 (unsigned int)
])
print("DataFrame with Casted Unsigned Integer Columns:\n", df2)
# Output:
# DataFrame with Casted Unsigned Integer Columns:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee ┆ Duration ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ str ┆ u32 │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark ┆ 20000 ┆ 30days ┆ 1000 │
│ PySpark ┆ 25000 ┆ 40days ┆ 2300 │
│ Python ┆ 22000 ┆ 35days ┆ 1200 │
│ pandas ┆ 30000 ┆ 50days ┆ 2000 │
└─────────┴───────┴──────────┴──────────┘
Here,
- The
pl.col("Fee").cast(pl.UInt32)
casts the'Fee'
column to an unsigned 32-bit integer (UInt32
), and the same is done for the'Discount'
column. - These types ensure the columns can only store non-negative integers (i.e., no negative values allowed).
Cast Boolean Columns to Integers
To cast Boolean columns to integers in polars, you can use the cast()
method and specify the appropriate integer type, such as pl.Int32
, pl.Int64
, etc. In this case, True
will be cast to 1
and False
will be cast to 0
.
import polars as pl
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas"],
'Is_Active': [True, False, True, False],
'Duration': ['30days', '40days', '35days', '50days'],
}
df = pl.DataFrame(technologies)
# Cast 'Is_Active' column to integers (Int32)
df2 = df.with_columns([
pl.col("Is_Active").cast(pl.Int32) # Cast Boolean to Integer (Int32)
])
print("DataFrame with Casted Boolean to Integer Column:\n", df2)
# Output:
# DataFrame with Casted Boolean to Integer Column:
shape: (4, 3)
┌─────────┬───────────┬──────────┐
│ Courses ┆ Is_Active ┆ Duration │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ str │
╞═════════╪═══════════╪══════════╡
│ Spark ┆ 1 ┆ 30days │
│ PySpark ┆ 0 ┆ 40days │
│ Python ┆ 1 ┆ 35days │
│ pandas ┆ 0 ┆ 50days │
└─────────┴───────────┴──────────┘
Here,
True
is converted to1
andFalse
is converted to0
when casting from Boolean to integer.- You can cast to different integer types, such as
Int32
,Int64
, etc., depending on your requirements.
Cast Timestamps to Datetime
To cast timestamps to datetime, use the cast()
method with the pl.Datetime
data type. This is commonly used when dealing with Unix timestamps (in seconds, milliseconds, or nanoseconds) that need to be converted into human-readable datetime formats.
import polars as pl
# Sample data with Unix timestamps in milliseconds
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas"],
'Start_Timestamp': [1672531200000, 1672617600000, 1672704000000, 1672790400000]
}
# Create DataFrame
df = pl.DataFrame(technologies)
# Cast 'Start_Timestamp' to Datetime
df_casted = df.with_columns([
pl.col("Start_Timestamp").cast(pl.Datetime("ms")) # Specify "ms" for milliseconds
])
print("DataFrame with Timestamps Cast to Datetime:\n", df_casted)
# Output:
# DataFrame with Timestamps Cast to Datetime:
# shape: (4, 2)
┌─────────┬─────────────────────┐
│ Courses ┆ Start_Timestamp │
│ --- ┆ --- │
│ str ┆ datetime[ms] │
╞═════════╪═════════════════════╡
│ Spark ┆ 2023-01-01 00:00:00 │
│ PySpark ┆ 2023-01-02 00:00:00 │
│ Python ┆ 2023-01-03 00:00:00 │
│ pandas ┆ 2023-01-04 00:00:00 │
└─────────┴─────────────────────┘
Conclusion
In conclusion, casting multiple columns in Polars is a powerful and versatile operation that allows you to transform data types for efficient processing and analysis. Whether you’re converting strings to numbers, timestamps to datetime, or booleans to integers, Polars provides robust tools to handle such transformations seamlessly.
Happy Learning!!
Related Articles
- Polars DataFrame.rename() Method
- Polars DataFrame.sort() Method
- Polars DataFrame.melt() Method
- Polars DataFrame.unique() Function
- Polars DataFrame.explode() Method
- Polars DataFrame.filter() Usage & Examples
- Polars DataFrame.join() Explained With Examples
- Polars DataFrame.pivot() Explained with Examples
- Polars DataFrame.groupby() Explained With Examples