In Polars, a schema refers to the structure of a DataFrame, which defines the names and types of columns it contains. It’s a way to understand and enforce the data types for each column, which is important for efficient operations and computations. You can access a DataFrame’s schema using the .schema
attribute, which returns a dictionary-like object mapping each column name to its corresponding data type. In this article, I will explain the schema property of a Polars DataFrame.
Key Points –
- The
schema()
property returns the structure of a Polars DataFrame, detailing the column names and their respective data types. schema()
is a property, not a method, meaning it can be accessed without parentheses (i.e.,df.schema
).- The schema is returned as a dictionary-like object, where column names are the keys and data types are the values.
- The data types returned include common types like
Utf8
(string),Int64
(integer),Float64
(float),Boolean
,Date
,Datetime
, andList
, among others. - The schema can describe more complex data structures like lists or nested types, such as columns containing lists of integers.
- The schema helps identify and enforce the correct data types, which is essential when performing transformations or data validation.
- The schema can handle nullable types, where a column may contain missing values, indicated by the nullable type in the schema.
- After operations like
cast()
,with_columns()
, orselect()
, theschema()
method can show updated column data types.
Polars DataFrame schema() Introduction
Let’s know the syntax of the Polars DataFrame schema() property.
# Syntax of schema()
property DataFrame.schema: Schema
Usage of Polars DataFrame schema()
The Polars DataFrame.schema()
property is used to retrieve the schema of a DataFrame, which provides information about the column names and their corresponding data types. This helps in examining the DataFrame’s structure in a more readable format.
First, let’s create a Polars DataFrame.
import polars as pl
# Creating a new Polars DataFrame
technologies = {
'Courses': ["Spark", "Hadoop", "Python", "Pandas"],
'Fees': [22000, 25000, 20000, 26000]}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To create a basic Polars DataFrame with string and integer columns, you can define a dictionary containing the column names as keys and their corresponding data as values.
# Displaying the schema of the DataFrame
df2 = df.schema
print("DataFrame Schema:\n", df2)
Here,
- The
Courses
column has string data (Utf8
type), and theFees
column has integer data (Int64
type). - The
df.schema
gives the schema of the DataFrame, showing the types for each column.
Schema with Float and Boolean Columns
To create a basic Polars DataFrame with columns of float and boolean types, you can define the appropriate data types for each column.
import polars as pl
# Creating a new Polars DataFrame with Float and Boolean columns
df = pl.DataFrame({
'Price': [19.99, 25.50, 15.75, 30.00],
'Available': [True, False, True, True]
})
df2 = df.schema
print("\nDataFrame Schema:\n", df2)
# Output:
# DataFrame Schema:
# Schema({'Price': Float64, 'Available': Boolean})
Here,
- The
Price
column has float values (Float64
type). - The
Available
column has boolean values (Boolean
type). - The
df.schema
property shows the schema of the DataFrame, which indicates the data types for each column.
Schema with Date and Time Columns
To create a Polars DataFrame with columns containing date and time data types, you can use Date
and Datetime
types. Here’s an example with columns StartDate
(Date) and StartTime
(Datetime).
import polars as pl
from datetime import datetime
# Creating a new Polars DataFrame with Date and Time columns
df = pl.DataFrame({
'StartDate': ["2025-02-01", "2025-02-02", "2025-02-03", "2025-02-04"],
'StartTime': [datetime(2025, 2, 1, 14, 30, 0), datetime(2025, 2, 2, 15, 45, 0),datetime(2025, 2, 3, 16, 0, 0), datetime(2025, 2, 4, 9, 15, 0)]
})
print("Original DataFrame:\n", df)
# Displaying the schema of the DataFrame
print("DataFrame Schema:\n", df.schema)
# Output:
# Original DataFrame:
# shape: (4, 2)
┌────────────┬─────────────────────┐
│ StartDate ┆ StartTime │
│ --- ┆ --- │
│ str ┆ datetime[μs] │
╞════════════╪═════════════════════╡
│ 2025-02-01 ┆ 2025-02-01 14:30:00 │
│ 2025-02-02 ┆ 2025-02-02 15:45:00 │
│ 2025-02-03 ┆ 2025-02-03 16:00:00 │
│ 2025-02-04 ┆ 2025-02-04 09:15:00 │
└────────────┴─────────────────────┘
# DataFrame Schema:
# Schema({'StartDate': String, 'StartTime': Datetime(time_unit='us', time_zone=None)})
Here,
- The
StartDate
column contains Date data (formatted as “YYYY-MM-DD”), and its type isDate
. - The
StartTime
column contains Datetime data, which includes both date and time, and its type isDatetime[ms]
(with millisecond precision). - The
df.schema
shows the data types of each column asDate
andDatetime
.
Schema with Multiple Integer Columns
To create a Polars DataFrame with multiple integer columns, you can define a dictionary with columns that have integer values.
import polars as pl
# Creating a new Polars DataFrame with multiple Integer columns
df = pl.DataFrame({
'ProductID': [101, 102, 103, 104],
'Quantity': [50, 60, 40, 30],
'Price': [200, 250, 180, 300]
})
# Displaying the schema of the DataFrame
df2 = df.schema
print("DataFrame Schema:\n", df2)
# Output:
# DataFrame Schema:
# Schema({'ProductID': Int64, 'Quantity': Int64, 'Price': Int64})
Here,
- The DataFrame
df
has three columns:ProductID
,Quantity
, andPrice
, all of which contain integer data (Int64
type). - The
df.schema
property outputs the schema of the DataFrame, showing that all three columns are of typeInt64
.
Schema with Mixed Types
To create a Polars DataFrame with mixed data types in different columns, you can define columns with different types, such as strings, integers, floats, and booleans.
import polars as pl
# Creating a new Polars DataFrame with mixed data types
df = pl.DataFrame({
'Name': ["Duckett", "smith", "Charlie", "David"],
'Age': [28, 31, 36, 43],
'Height': [5.6, 6.0, 5.8, 6.1],
'IsActive': [True, False, True, False]
})
print("DataFrame:\n", df)
# Displaying the schema of the DataFrame
print("\nDataFrame Schema:\n", df.schema)
# Output:
# DataFrame:
# shape: (4, 4)
┌─────────┬─────┬────────┬──────────┐
│ Name ┆ Age ┆ Height ┆ IsActive │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞═════════╪═════╪════════╪══════════╡
│ Duckett ┆ 28 ┆ 5.6 ┆ true │
│ smith ┆ 31 ┆ 6.0 ┆ false │
│ Charlie ┆ 36 ┆ 5.8 ┆ true │
│ David ┆ 43 ┆ 6.1 ┆ false │
└─────────┴─────┴────────┴──────────┘
# DataFrame Schema:
# Schema({'Name': String, 'Age': Int64, 'Height': Float64, 'IsActive': Boolean})
Here,
- The
Name
column contains string data (Utf8
type). - The
Age
column contains integer data (Int64
type). - The
Height
column contains float data (Float64
type). - The
IsActive
column contains boolean data (Boolean
type). - The
df.schema
shows the data types for each column:Utf8
,Int64
,Float64
, andBoolean
.
Schema with List Column
To create a Polars DataFrame with a list column, you can define a column where each entry is a list. Polars supports list columns, allowing you to store multiple values in a single column entry.
import polars as pl
# Creating a new Polars DataFrame with a List column
df = pl.DataFrame({
'ID': [1, 2, 3],
'Courses': [["Spark", "Hadoop",], ["Polars", "Python", "Pandas"], ['Pyspark', 'C++']],
'Fees': [22000, 25000, 20000]
})
print("DataFrame:\n", df)
# Displaying the schema of the DataFrame
print("DataFrame Schema:\n", df.schema)
# Output:
# DataFrame:
# shape: (3, 3)
┌─────┬────────────────────────────────┬───────┐
│ ID ┆ Courses ┆ Fees │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ i64 │
╞═════╪════════════════════════════════╪═══════╡
│ 1 ┆ ["Spark", "Hadoop"] ┆ 22000 │
│ 2 ┆ ["Polars", "Python", "Pandas"] ┆ 25000 │
│ 3 ┆ ["Pyspark", "C++"] ┆ 20000 │
└─────┴────────────────────────────────┴───────┘
# DataFrame Schema:
# Schema({'ID': Int64, 'Courses': List(String), 'Fees': Int64})
Here,
- The
Courses
column contains list data, with each entry being a list of strings. The type for this column isList(String)
which means the list contains strings. - The
ID
andFees
columns contain integer data (Int64
type). - The
df.schema
shows the data types for each column:Int64
,List(String)
, andInt64
.
Conclusion
In summary, the schema
property in Polars is an essential feature for examining a DataFrame’s structure. It returns a dictionary that maps column names to their corresponding data types, ensuring accuracy and clarity in data processing.
Happy Learning!!
Related Articles
- Polars DataFrame max() Method
- Polars DataFrame drop() Method
- Polars DataFrame select() Method
- Polars Cast String to Integer
- Convert Polars Cast Int to String
- Convert Polars Cast String to Float
- Convert Polars Cast Float to String
- Polars DataFrame describe() Function
- Polars DataFrame shift() Usage & Examples
- Polars Filter DataFrame with Multilple Conditions