In Polars, the var()
function on a DataFrame is used to calculate the variance of numerical columns. Variance is a statistical measure of how spread out the values in a column are from the mean. A higher variance means more spread; a lower variance means the values are closer to the mean.
In this article, I will explain the Polars DataFrame var()
method, covering its syntax, parameters, and practical usage. This function generates a new Polars DataFrame containing the variance for each numeric column, automatically excluding any non-numeric columns.
Key Points –
- The
var()
method computes the sample variance of each numeric column in a DataFrame. - The
var()
function works only on numeric columns (integers and floats) in the DataFrame. - By default, it calculates the sample variance using
ddof=1
(degrees of freedom). - To compute population variance, set
ddof=0
. - Non-numeric columns (e.g., strings, booleans) are automatically excluded from the variance calculation.
- Null values are ignored in the calculation. Only non-null values are considered when calculating the variance.
- Polars is optimized for speed, so
var()
can handle large datasets efficiently by leveraging parallel processing. - The result of
var()
is a DataFrame with a single row where each column represents the variance of the corresponding numeric column in the original DataFrame.
Polars DataFrame var() Introduction
Let’s know the syntax of the DataFrame var() function.
# Syntax of var() function
DataFrame.var(ddof: int = 1) → DataFrame
Parameters of the Polars DataFrame var()
It allows only one parameter.
ddof
stands for Delta Degrees of Freedom.- It’s a parameter that adjusts how the variance is calculated:
- ddof=1 (default): calculates sample variance (divides by n – 1).
- ddof=0: calculates population variance (divides by n)
Return Value
This function returns a new Polars DataFrame with the variance of each numeric column. Non-numeric columns are excluded.
Usage of Polars DataFrame var() Method
The var()
function in Polars calculates the variance of a column or expression. Variance indicates how much the values in a dataset deviate from the average (mean).
First, let’s create a Polars DataFrame.
import polars as pl
# Creating a sample DataFrame
data = {
'A': [15, 38, 13, 24],
'B': [32, 21, 49, 11],
'C': [12, 22, 36, 18]
}
df = pl.DataFrame(data)
print("Original DataFrame:\n", df)
Yields below output.
The var()
method in Polars computes the variance for each numeric column in a DataFrame. Variance is a statistical measure that tells you how spread out the values in a dataset are around the mean.
# Calculating the variance of numeric columns
result = df.var()
print("Variance of Numeric Columns:\n", result)
Here,
- The variance is calculated for each of the numeric columns (
A
,B
, andC
). - The
var()
method returns the variance of each numeric column in a new DataFrame. - The result is a single row with the variance of each column.
Using ddof=0 (Population Variance)
To calculate the population variance using the ddof=0
argument in the var()
method, you simply need to set the ddof
parameter to 0
. This will calculate the variance as if the data represents the entire population rather than a sample (which is the default behavior with ddof=1
).
# Calculating the population variance (ddof=0)
result = df.var(ddof=0)
print("Population Variance of Numeric Columns:\n", result)
# Output:
# Population Variance of Numeric Columns:
# shape: (1, 3)
┌───────┬──────────┬──────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════╪══════════╪══════╡
│ 97.25 ┆ 198.6875 ┆ 78.0 │
└───────┴──────────┴──────┘
Here,
- The population variance is calculated by setting
ddof=0
. In this case, the formula divides byn
(the total number of elements) instead ofn-1
(used in sample variance). - The result gives you the variance of each numeric column (
A
,B
,C
) treating the data as the entire population.
Variance on Floating-Point Numbers
Polars handles floating-point numbers seamlessly with the var()
method, just like it does with integers. The method will compute the variance using the float values as-is, which is especially useful in real-world data scenarios like measurements, scores, prices, etc.
import polars as pl
# DataFrame with floating-point numbers
data = {
'X': [2.5, 3.7, 1.8, 4.1],
'Y': [7.2, 6.8, 8.9, 5.3],
}
df = pl.DataFrame(data)
# Compute sample variance (default ddof=1)
result = df.var()
print("Sample Variance (Floating-Point Data):\n", result)
# Output:
# Sample Variance (Floating-Point Data):
# shape: (1, 2)
┌──────────┬──────┐
│ X ┆ Y │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════════╪══════╡
│ 1.129167 ┆ 2.19 │
└──────────┴──────┘
Here,
- Polars automatically handles floating-point precision and returns a
f64
result. X
andY
are both float columns, and the variance is computed with the sample formula (dividing byn-1
).
Including a Non-Numeric Column
When you apply the var()
method in Polars to a DataFrame that includes non-numeric columns, Polars will automatically ignore those non-numeric columns. It only calculates variance for the numeric columns in the DataFrame.
import polars as pl
# Creating a sample DataFrame with a non-numeric column
data = {
'A': [15, 38, 13, 24],
'B': [32, 21, 49, 11],
'C': [12, 22, 36, 18],
'Category': ['X', 'Y', 'Z', 'W'] # Non-numeric column
}
df = pl.DataFrame(data)
# Calculating the variance of numeric columns
result = df.var()
print("Variance of Numeric Columns:\n", result)
# Output:
# Variance of Numeric Columns:
# shape: (1, 4)
┌────────────┬────────────┬───────┬──────────┐
│ A ┆ B ┆ C ┆ Category │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ str │
╞════════════╪════════════╪═══════╪══════════╡
│ 129.666667 ┆ 264.916667 ┆ 104.0 ┆ null │
└────────────┴────────────┴───────┴──────────┘
Here,
- The non-numeric column Category (with values like
'X'
,'Y'
, etc.) is ignored when calculating the variance. - Only the numeric columns
A
,B
, andC
are included in the output, with their respective variances.
Variance After Filtering Rows
You can filter rows based on conditions and then compute the variance on the filtered subset of the DataFrame using filter()
followed by var()
.
import polars as pl
# Sample DataFrame
df = pl.DataFrame({
'City': ["Delhi", "Delhi", "Mumbai", "Mumbai", "Delhi"],
'Temperature': [28.5, 30.2, 33.1, 29.8, 31.0],
'Humidity': [65.0, 70.2, 75.1, 69.8, 68.0]
})
# Filter for only rows where City is "Delhi"
filtered_df = df.filter(pl.col("City") == "Delhi")
# Compute variance on filtered rows
result = filtered_df.select(pl.exclude("City").var())
print("Variance for 'Delhi':\n", result)
# Output:
# Variance for 'Delhi':
# shape: (1, 2)
┌─────────────┬──────────┐
│ Temperature ┆ Humidity │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════════╪══════════╡
│ 1.63 ┆ 6.813333 │
└─────────────┴──────────┘
Here,
- We filter the DataFrame to only include rows where
City == "Delhi"
. - We exclude the non-numeric column
City
usingpl.exclude("City")
before callingvar()
. - Then,
var()
is applied to the remaining numeric columns.
Variance with Null Values
When you have null
(missing) values in a column, Polars automatically skips them while computing variance, it only uses the valid numeric values.
import polars as pl
# DataFrame with some null (None) values
data = {
'A': [10, 20, None, 30],
'B': [5, None, 15, 25],
'C': [None, None, None, None] # All nulls
}
df = pl.DataFrame(data)
# Compute variance
result = df.var()
print("Variance with Null Values:\n", result)
# Output:
# Variance with Null Values:
# shape: (1, 3)
┌───────┬───────┬──────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ null │
╞═══════╪═══════╪══════╡
│ 100.0 ┆ 100.0 ┆ null │
└───────┴───────┴──────┘
Here,
- Columns
A
andB
contain some nulls, Polars computes variance using the non-null values only. - Column
C
is all null, Polars returnsnull
for its variance because there’s no valid data to compute from.
Conclusion
In conclusion, the var() method in Polars is a powerful tool for computing the variance of numeric columns in a DataFrame. By default, it calculates the sample variance, but you can adjust the degree of freedom (ddof) to calculate population variance as well.
Happy Learning!!
Related Articles
- Polars DataFrame slice() – Usage & Examples
- Polars DataFrame width Usage & Examples
- Polars DataFrame height – Explained by Examples
- Polars DataFrame clone() – Explained by Examples
- Polars DataFrame clear() Usage & Examples
- Polars DataFrame shift() Usage & Examples
- Polars DataFrame product() Usage with Examples
- Polars DataFrame fill_nan() Usage & Examples
- Polars DataFrame replace_column() – by Examples
- Polars DataFrame sum_horizontal() Method with Examples