In Polars, you can select columns by their data type using the select()
method along with the pl.col()
function. This is useful when you want to operate on specific types of data, such as selecting only numeric columns or filtering out string columns. Additionally, pl.selectors.by_dtype()
allows for dynamic column selection based on data type, enabling efficient filtering and manipulation of DataFrame columns. In this article, I will explain select columns by data type in polars.
Key Points –
- Use
pl.selectors.by_dtype()
to select columns based on specific data types dynamically. pl.Utf8
represents string columns in Polars (equivalent toobject
in pandas).- Use
pl.col(dtype)
insideselect()
to filter columns of a specific data type. - Common data types in Polars include
pl.Int64
,pl.Float64
,pl.Utf8 (string)
, andpl.Boolean
. - Mixed numeric types (
Int64
,Float64
) can be selected together by passing a list of types. cs.numeric()
selects all numeric columns, including both integers (Int64
) and floats (Float64
).cs.by_dtype([pl.Int64, pl.Float64])
allows selecting multiple data types at once.~cs.by_dtype(pl.NUMERIC_DTYPES)
can be used to exclude all numeric columns.- Combination with
select()
orwith_columns()
enables transformations after selection.
Usage of Select Columns by Data Type
Selecting columns by data type in Polars is useful when working with large datasets containing mixed data types. Instead of manually specifying column names, you can select columns dynamically based on their data type using the pl.col.by_dtype()
function inside the select()
method.
To run some examples of how to select columns by data type in polars, let’s create a Polars DataFrame.
import polars as pl
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
'Fees' :[20000,25000,26000,22000,24000],
'Discount':[1000.0,2300.0,1200.0,2500.0,2000.0]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)
Yields below output.
To select only integer (Int64
) columns from the Polars DataFrame, you can use the select() method with the pl.col()
function and pl.Int64
data type filter. This will filter only the integer columns.
# Selecting only integer (Int64) columns
df2 = df.select(pl.col(pl.Int64))
print(df2)
Here,
pl.col(pl.Int64)
selects only the columns of typeInt64
.- The resulting DataFrame contains only the
"Fees"
column, since"Discount"
is of typef64
and"Courses"
isstr
.
Select All Float Columns
Alternatively, to select all float (f64
) columns from the Polars DataFrame, you can use the select()
method with pl.col(pl.Float64)
.
# Selecting only float (Float64) columns
df2 = df.select(pl.col(pl.Float64))
print(df2)
# Output:
# shape: (5, 1)
┌──────────┐
│ Discount │
│ --- │
│ f64 │
╞══════════╡
│ 1000.0 │
│ 2300.0 │
│ 1200.0 │
│ 2500.0 │
│ 2000.0 │
└──────────┘
Here,
- The
pl.col(pl.Float64)
selects only columns of float (Float64
) type. - In this case, only the
"Discount"
column is ofFloat64
type, so it is returned.
Select All String (Utf8) Columns
To select all string (Utf8
) columns in a Polars DataFrame, use the select()
method along with pl.col(pl.Utf8)
.
# Selecting only string (Utf8) columns
df2 = df.select(pl.col(pl.Utf8))
print(df2)
# Output:
# shape: (5, 1)
┌─────────┐
│ Courses │
│ --- │
│ str │
╞═════════╡
│ Spark │
│ PySpark │
│ Hadoop │
│ Python │
│ pandas │
└─────────┘
Here,
pl.col(pl.Utf8)
selects only columns withUtf8
(string) type.- The output contains only the
"Courses"
column, as it is the only string column in the dataset.
Select Multiple Data Types Together (String + Integer)
Similarly, to select multiple data types together (e.g., String(Utf8) + Integer(Int64)
) in a Polars DataFrame, use the select()
method with multiple column types inside pl.col()
.
# Selecting String (Utf8) and Integer (Int64) columns
df2= df.select(pl.col(pl.Utf8, pl.Int64))
print(df2)
# Selecting String (Utf8) and Integer (Int64) columns
df2 = df.select(pl.col(pl.String, pl.Int64))
print(df2)
# Output:
# shape: (5, 2)
┌─────────┬───────┐
│ Courses ┆ Fees │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪═══════╡
│ Spark ┆ 20000 │
│ PySpark ┆ 25000 │
│ Hadoop ┆ 26000 │
│ Python ┆ 22000 │
│ pandas ┆ 24000 │
└─────────┴───────┘
Here,
pl.col(pl.Utf8, pl.Int64)
selects both string (Utf8
) and integer (Int64
) columns.- The
"Courses"
column (String) and the"Fees"
column (Integer) are returned.
Select Multiple Data Types Using pl.selectors.by_dtype()
The pl.selectors.by_dtype()
function in Polars allows you to select columns based on their data types efficiently. You can pass one or more data types to filter specific columns.
# Select String (Utf8) and Integer (Int64) columns
df2 = df.select(pl.selectors.by_dtype([pl.Utf8, pl.Int64]))
print(df2)
# Select specific data types: String (Utf8), Integer (Int64)
df2 = df.select(cs.by_dtype([pl.Utf8, pl.Int64]))
print(df2)
# Output:
# shape: (5, 2)
┌─────────┬───────┐
│ Courses ┆ Fees │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪═══════╡
│ Spark ┆ 20000 │
│ PySpark ┆ 25000 │
│ Hadoop ┆ 26000 │
│ Python ┆ 22000 │
│ pandas ┆ 24000 │
└─────────┴───────┘
Here,
cs.by_dtype([pl.Utf8, pl.Int64])
selects String (Utf8
) and Integer (Int64
) columns.- The
"Courses"
column (String) and"Fees"
column (Integer) match the criteria. - The
"Discount"
column (Float64) is excluded.
Select All Non-Numeric Columns
To select all non-numeric columns in Polars, use column selectors (cs
) with pl.NUMERIC_DTYPES
and the negation (~
) operator.
# Select only non-numeric columns
df2 = df.select(~cs.by_dtype(pl.NUMERIC_DTYPES))
print(df2)
# Select only non-numeric columns
df2 = df.select(~cs.numeric())
print(df2)
# Output:
# shape: (5, 1)
┌─────────┐
│ Courses │
│ --- │
│ str │
╞═════════╡
│ Spark │
│ PySpark │
│ Hadoop │
│ Python │
│ pandas │
└─────────┘
Here,
cs.by_dtype(pl.NUMERIC_DTYPES)
selects all numeric columns (Int64
,Float64
, etc.).~
(tilde) negates the selection, excluding numeric columns.- As a result, only non-numeric columns (like
"Courses"
, which is of typeUtf8
) remain.
Convert All Numeric Types to Float32
Finally, to convert all numeric columns (both Int64
and Float64
) to Float32
in Polars, use pl.selectors.by_dtype()
with cast(pl.Float32)
.
# Convert all numeric columns (Int64 and Float64) to Float32
df2 = df.with_columns(df.select(cs.numeric()).cast(pl.Float32))
print(df2)
# convert all numeric types to float32
df2 = df.with_columns(cs.numeric().cast(pl.Float32))
print(df2)
# Output:
# shape: (5, 3)
┌─────────┬─────────┬──────────┐
│ Courses ┆ Fees ┆ Discount │
│ --- ┆ --- ┆ --- │
│ str ┆ f32 ┆ f32 │
╞═════════╪═════════╪══════════╡
│ Spark ┆ 20000.0 ┆ 1000.0 │
│ PySpark ┆ 25000.0 ┆ 2300.0 │
│ Hadoop ┆ 26000.0 ┆ 1200.0 │
│ Python ┆ 22000.0 ┆ 2500.0 │
│ pandas ┆ 24000.0 ┆ 2000.0 │
└─────────┴─────────┴──────────┘
Here,
cs.numeric()
selects all numeric columns (Int64
andFloat64
).cast(pl.Float32)
converts all selected columns toFloat32
.- String columns (
Utf8
) remain unchanged.
Conclusion
In summary, selecting columns by data type in Polars provides a powerful and efficient way to dynamically manipulate DataFrames. With select()
, pl.selectors.by_dtype()
, pl.col()
, and cs.numeric()
, you can effortlessly filter and operate on specific column types without the need to manually specify column names.
Happy Learning!!
Related Articles
- Convert Polars Cast Int to String
- Convert Polars Cast String to Float
- Convert Polars Cast Float to String
- Polars Convert Cast String to Integer
- Polars DataFrame count() Function
- Polars DataFrame limit() Method
- Polars DataFrame row() Usage & Examples
- Polars DataFrame median() Usage & Examples
- Polars.DataFrame.mean() – Explained by Examples
- Polars DataFrame partition_by() Usage & Examples