• Post author:
  • Post category:Polars
  • Post last modified:January 8, 2025
  • Reading time:12 mins read

In Polars, you can cast multiple columns to different data types by using the select() or with_columns() method along with the pl.col() expression and the cast() function. By combining these, you can specify both the columns and the desired target types for conversion in a Polars DataFrame. In this article, I will explain the polars cast multiple columns.

Advertisements

Key Points –

  • Casting allows you to change the data type of one or more columns in a Polars DataFrame, ensuring consistency and enabling specific operations.
  • The with_columns method is typically used to apply transformations, including type casting, to multiple columns at once.
  • Use pl.col("<column_name>").cast(<data_type>) to specify the column and its target data type.
  • Pass a list of transformations to with_columns to cast multiple columns simultaneously.
  • Common data types include pl.Int32, pl.Int64, pl.Float32, pl.Float64, pl.Boolean, pl.Utf8 (string), and pl.Date.
  • The with_columns() method can be seamlessly chained with other DataFrame operations for streamlined workflows.
  • Columns can be selected by name (pl.col("name")), by data type, or by a condition for dynamic transformations.

Usage of Polars Cast Multiple Columns

Casting multiple columns in Polars involves transforming the data types of selected columns in a DataFrame to match the desired format. This operation is crucial for data cleaning, preprocessing, and ensuring compatibility with downstream analysis or storage systems.

First, let’s create a Polars DataFrame.


import polars as pl

technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000, 30000],
    'Duration':['30days','40days','35days','50days'],
    'Discount':[1000,2300,1200,2000]
              }
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

Polars cast multiple columns

To cast the numeric columns (such as 'Fee' and 'Discount') to floats in your Polars DataFrame, you can use the cast() function within the with_columns() method.


# Cast 'Fee' and 'Discount' columns to Float64
df2 = df.with_columns([
    pl.col("Fee").cast(pl.Float64),       # Cast 'Fee' to Float64
    pl.col("Discount").cast(pl.Float64)   # Cast 'Discount' to Float64
])
print("DataFrame with Casted Columns:\n", df2)

Here,

  • We used pl.col("Fee").cast(pl.Float64) to cast the 'Fee' column to Float64 and pl.col("Discount").cast(pl.Float64) for the 'Discount' column.
  • The Duration and Courses columns remain unaffected as they are of str type.
Polars cast multiple columns

Cast Mixed Columns to Strings

To cast mixed columns (columns with different data types) to strings in Polars, you can use the cast(pl.Utf8) method to convert them into string format.


# Cast 'Fee' and 'Discount' columns to strings
df2 = df.with_columns([
    pl.col("Fee").cast(pl.Utf8),        # Cast 'Fee' to String (Utf8)
    pl.col("Discount").cast(pl.Utf8)    # Cast 'Discount' to String (Utf8)
])
print("DataFrame with Casted Columns:\n", df2)

# Output:
# DataFrame with Casted Columns:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ str   ┆ str      ┆ str      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The pl.col("Fee").cast(pl.Utf8) casts the 'Fee' column to a string (Utf8), and the same is done for the 'Discount' column.
  • The 'Courses' and 'Duration' columns already contain string values, so they remain unchanged.

Cast Multiple Columns Dynamically

To cast multiple columns dynamically based on a specific condition (e.g., casting all numeric columns to strings), you can loop through the columns and apply the cast operation conditionally.


# Dynamically cast all numeric columns to strings (Utf8)
df2 = df.select([
    pl.col(col).cast(pl.Utf8) if df[col].dtype in [pl.Int64, pl.Float64] else pl.col(col)
    for col in df.columns
])
print("DataFrame with Dynamically Casted Columns:\n", df2)

Here,

  • We loop through all the columns in the DataFrame using df.columns.
  • For each column, we check if its data type is numeric (Int64 or Float64) using df[col].dtype. If it is numeric, we cast it to a string (Utf8), otherwise, we leave it unchanged.
  • pl.col(col) selects each column, and cast(pl.Utf8) is applied conditionally.

Yields same output as above.

Cast Columns to Unsigned Integers

To cast columns to unsigned integers, you can use the cast() method and specify an unsigned integer type, such as UInt8, UInt16, UInt32, or UInt64. This is particularly useful when you are working with columns that contain non-negative values and want to ensure the data type is unsigned.


# Cast 'Fee' and 'Discount' columns 
# To unsigned integers (UInt32)
df2 = df.with_columns([
    pl.col("Fee").cast(pl.UInt32),        # Cast 'Fee' to UInt32 (unsigned int)
    pl.col("Discount").cast(pl.UInt32)    # Cast 'Discount' to UInt32 (unsigned int)
])
print("DataFrame with Casted Unsigned Integer Columns:\n", df2)

# Output:
# DataFrame with Casted Unsigned Integer Columns:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ u32   ┆ str      ┆ u32      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The pl.col("Fee").cast(pl.UInt32) casts the 'Fee' column to an unsigned 32-bit integer (UInt32), and the same is done for the 'Discount' column.
  • These types ensure the columns can only store non-negative integers (i.e., no negative values allowed).

Cast Boolean Columns to Integers

To cast Boolean columns to integers in polars, you can use the cast() method and specify the appropriate integer type, such as pl.Int32, pl.Int64, etc. In this case, True will be cast to 1 and False will be cast to 0.


import polars as pl

technologies = {
    'Courses': ["Spark", "PySpark", "Python", "pandas"],
    'Is_Active': [True, False, True, False],
    'Duration': ['30days', '40days', '35days', '50days'],
}

df = pl.DataFrame(technologies)

# Cast 'Is_Active' column to integers (Int32)
df2 = df.with_columns([
    pl.col("Is_Active").cast(pl.Int32)  # Cast Boolean to Integer (Int32)
])
print("DataFrame with Casted Boolean to Integer Column:\n", df2)

# Output:
# DataFrame with Casted Boolean to Integer Column:
 shape: (4, 3)
┌─────────┬───────────┬──────────┐
│ Courses ┆ Is_Active ┆ Duration │
│ ---     ┆ ---       ┆ ---      │
│ str     ┆ i32       ┆ str      │
╞═════════╪═══════════╪══════════╡
│ Spark   ┆ 1         ┆ 30days   │
│ PySpark ┆ 0         ┆ 40days   │
│ Python  ┆ 1         ┆ 35days   │
│ pandas  ┆ 0         ┆ 50days   │
└─────────┴───────────┴──────────┘

Here,

  • True is converted to 1 and False is converted to 0 when casting from Boolean to integer.
  • You can cast to different integer types, such as Int32, Int64, etc., depending on your requirements.

Cast Timestamps to Datetime

To cast timestamps to datetime, use the cast() method with the pl.Datetime data type. This is commonly used when dealing with Unix timestamps (in seconds, milliseconds, or nanoseconds) that need to be converted into human-readable datetime formats.


import polars as pl

# Sample data with Unix timestamps in milliseconds
technologies = {
    'Courses': ["Spark", "PySpark", "Python", "pandas"],
    'Start_Timestamp': [1672531200000, 1672617600000, 1672704000000, 1672790400000]
}

# Create DataFrame
df = pl.DataFrame(technologies)

# Cast 'Start_Timestamp' to Datetime
df_casted = df.with_columns([
    pl.col("Start_Timestamp").cast(pl.Datetime("ms"))  # Specify "ms" for milliseconds
])
print("DataFrame with Timestamps Cast to Datetime:\n", df_casted)

# Output:
# DataFrame with Timestamps Cast to Datetime:
# shape: (4, 2)
┌─────────┬─────────────────────┐
│ Courses ┆ Start_Timestamp     │
│ ---     ┆ ---                 │
│ str     ┆ datetime[ms]        │
╞═════════╪═════════════════════╡
│ Spark   ┆ 2023-01-01 00:00:00 │
│ PySpark ┆ 2023-01-02 00:00:00 │
│ Python  ┆ 2023-01-03 00:00:00 │
│ pandas  ┆ 2023-01-04 00:00:00 │
└─────────┴─────────────────────┘

Conclusion

In conclusion, casting multiple columns in Polars is a powerful and versatile operation that allows you to transform data types for efficient processing and analysis. Whether you’re converting strings to numbers, timestamps to datetime, or booleans to integers, Polars provides robust tools to handle such transformations seamlessly.

Happy Learning!!

References