• Post author:
  • Post category:Polars
  • Post last modified:February 14, 2025
  • Reading time:13 mins read
You are currently viewing Polars DataFrame schema() Usage & Examples

In Polars, a schema refers to the structure of a DataFrame, which defines the names and types of columns it contains. It’s a way to understand and enforce the data types for each column, which is important for efficient operations and computations. You can access a DataFrame’s schema using the .schema attribute, which returns a dictionary-like object mapping each column name to its corresponding data type. In this article, I will explain the schema property of a Polars DataFrame.

Advertisements

Key Points –

  • The schema() property returns the structure of a Polars DataFrame, detailing the column names and their respective data types.
  • schema() is a property, not a method, meaning it can be accessed without parentheses (i.e., df.schema).
  • The schema is returned as a dictionary-like object, where column names are the keys and data types are the values.
  • The data types returned include common types like Utf8 (string), Int64 (integer), Float64 (float), Boolean, Date, Datetime, and List, among others.
  • The schema can describe more complex data structures like lists or nested types, such as columns containing lists of integers.
  • The schema helps identify and enforce the correct data types, which is essential when performing transformations or data validation.
  • The schema can handle nullable types, where a column may contain missing values, indicated by the nullable type in the schema.
  • After operations like cast(), with_columns(), or select(), the schema() method can show updated column data types.

Polars DataFrame schema() Introduction

Let’s know the syntax of the Polars DataFrame schema() property.


# Syntax of schema()
property DataFrame.schema: Schema

Usage of Polars DataFrame schema()

The Polars DataFrame.schema() property is used to retrieve the schema of a DataFrame, which provides information about the column names and their corresponding data types. This helps in examining the DataFrame’s structure in a more readable format.

First, let’s create a Polars DataFrame.


import polars as pl

# Creating a new Polars DataFrame
technologies = {
    'Courses': ["Spark", "Hadoop", "Python", "Pandas"],
    'Fees': [22000, 25000, 20000, 26000]}

df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars schema

To create a basic Polars DataFrame with string and integer columns, you can define a dictionary containing the column names as keys and their corresponding data as values.


# Displaying the schema of the DataFrame
df2 = df.schema
print("DataFrame Schema:\n", df2)

Here,

  • The Courses column has string data (Utf8 type), and the Fees column has integer data (Int64 type).
  • The df.schema gives the schema of the DataFrame, showing the types for each column.
polars schema

Schema with Float and Boolean Columns

To create a basic Polars DataFrame with columns of float and boolean types, you can define the appropriate data types for each column.


import polars as pl

# Creating a new Polars DataFrame with Float and Boolean columns
df = pl.DataFrame({
    'Price': [19.99, 25.50, 15.75, 30.00],
    'Available': [True, False, True, True]
})

df2 = df.schema
print("\nDataFrame Schema:\n", df2)

# Output:
# DataFrame Schema:
# Schema({'Price': Float64, 'Available': Boolean})

Here,

  • The Price column has float values (Float64 type).
  • The Available column has boolean values (Boolean type).
  • The df.schema property shows the schema of the DataFrame, which indicates the data types for each column.

Schema with Date and Time Columns

To create a Polars DataFrame with columns containing date and time data types, you can use Date and Datetime types. Here’s an example with columns StartDate (Date) and StartTime (Datetime).


import polars as pl
from datetime import datetime

# Creating a new Polars DataFrame with Date and Time columns
df = pl.DataFrame({
    'StartDate': ["2025-02-01", "2025-02-02", "2025-02-03", "2025-02-04"],
    'StartTime': [datetime(2025, 2, 1, 14, 30, 0), datetime(2025, 2, 2, 15, 45, 0),datetime(2025, 2, 3, 16, 0, 0), datetime(2025, 2, 4, 9, 15, 0)]
})
print("Original DataFrame:\n", df)

# Displaying the schema of the DataFrame
print("DataFrame Schema:\n", df.schema)

# Output:
# Original DataFrame:
# shape: (4, 2)
┌────────────┬─────────────────────┐
│ StartDate  ┆ StartTime           │
│ ---        ┆ ---                 │
│ str        ┆ datetime[μs]        │
╞════════════╪═════════════════════╡
│ 2025-02-01 ┆ 2025-02-01 14:30:00 │
│ 2025-02-02 ┆ 2025-02-02 15:45:00 │
│ 2025-02-03 ┆ 2025-02-03 16:00:00 │
│ 2025-02-04 ┆ 2025-02-04 09:15:00 │
└────────────┴─────────────────────┘
# DataFrame Schema:
# Schema({'StartDate': String, 'StartTime': Datetime(time_unit='us', time_zone=None)})

Here,

  • The StartDate column contains Date data (formatted as “YYYY-MM-DD”), and its type is Date.
  • The StartTime column contains Datetime data, which includes both date and time, and its type is Datetime[ms] (with millisecond precision).
  • The df.schema shows the data types of each column as Date and Datetime.

Schema with Multiple Integer Columns

To create a Polars DataFrame with multiple integer columns, you can define a dictionary with columns that have integer values.


import polars as pl

# Creating a new Polars DataFrame with multiple Integer columns
df = pl.DataFrame({
    'ProductID': [101, 102, 103, 104],
    'Quantity': [50, 60, 40, 30],
    'Price': [200, 250, 180, 300]
})

# Displaying the schema of the DataFrame
df2 = df.schema
print("DataFrame Schema:\n", df2)

# Output:
# DataFrame Schema:
# Schema({'ProductID': Int64, 'Quantity': Int64, 'Price': Int64})

Here,

  • The DataFrame df has three columns: ProductID, Quantity, and Price, all of which contain integer data (Int64 type).
  • The df.schema property outputs the schema of the DataFrame, showing that all three columns are of type Int64.

Schema with Mixed Types

To create a Polars DataFrame with mixed data types in different columns, you can define columns with different types, such as strings, integers, floats, and booleans.


import polars as pl

# Creating a new Polars DataFrame with mixed data types
df = pl.DataFrame({
    'Name': ["Duckett", "smith", "Charlie", "David"],
    'Age': [28, 31, 36, 43],
    'Height': [5.6, 6.0, 5.8, 6.1],
    'IsActive': [True, False, True, False]
})
print("DataFrame:\n", df)

# Displaying the schema of the DataFrame
print("\nDataFrame Schema:\n", df.schema)

# Output:
# DataFrame:
# shape: (4, 4)
┌─────────┬─────┬────────┬──────────┐
│ Name    ┆ Age ┆ Height ┆ IsActive │
│ ---     ┆ --- ┆ ---    ┆ ---      │
│ str     ┆ i64 ┆ f64    ┆ bool     │
╞═════════╪═════╪════════╪══════════╡
│ Duckett ┆ 28  ┆ 5.6    ┆ true     │
│ smith   ┆ 31  ┆ 6.0    ┆ false    │
│ Charlie ┆ 36  ┆ 5.8    ┆ true     │
│ David   ┆ 43  ┆ 6.1    ┆ false    │
└─────────┴─────┴────────┴──────────┘

# DataFrame Schema:
# Schema({'Name': String, 'Age': Int64, 'Height': Float64, 'IsActive': Boolean})

Here,

  • The Name column contains string data (Utf8 type).
  • The Age column contains integer data (Int64 type).
  • The Height column contains float data (Float64 type).
  • The IsActive column contains boolean data (Boolean type).
  • The df.schema shows the data types for each column: Utf8, Int64, Float64, and Boolean.

Schema with List Column

To create a Polars DataFrame with a list column, you can define a column where each entry is a list. Polars supports list columns, allowing you to store multiple values in a single column entry.


import polars as pl

# Creating a new Polars DataFrame with a List column
df = pl.DataFrame({
    'ID': [1, 2, 3],
    'Courses': [["Spark", "Hadoop",], ["Polars", "Python", "Pandas"], ['Pyspark', 'C++']],
    'Fees': [22000, 25000, 20000]
})
print("DataFrame:\n", df)

# Displaying the schema of the DataFrame
print("DataFrame Schema:\n", df.schema)

# Output:
# DataFrame:
# shape: (3, 3)
┌─────┬────────────────────────────────┬───────┐
│ ID  ┆ Courses                        ┆ Fees  │
│ --- ┆ ---                            ┆ ---   │
│ i64 ┆ list[str]                      ┆ i64   │
╞═════╪════════════════════════════════╪═══════╡
│ 1   ┆ ["Spark", "Hadoop"]            ┆ 22000 │
│ 2   ┆ ["Polars", "Python", "Pandas"] ┆ 25000 │
│ 3   ┆ ["Pyspark", "C++"]             ┆ 20000 │
└─────┴────────────────────────────────┴───────┘
# DataFrame Schema:
#  Schema({'ID': Int64, 'Courses': List(String), 'Fees': Int64})

Here,

  • The Courses column contains list data, with each entry being a list of strings. The type for this column is List(String) which means the list contains strings.
  • The ID and Fees columns contain integer data (Int64 type).
  • The df.schema shows the data types for each column: Int64, List(String), and Int64.

Conclusion

In summary, the schema property in Polars is an essential feature for examining a DataFrame’s structure. It returns a dictionary that maps column names to their corresponding data types, ensuring accuracy and clarity in data processing.

Happy Learning!!

References