• Post author:
  • Post category:Polars
  • Post last modified:February 25, 2025
  • Reading time:13 mins read
You are currently viewing Polars DataFrame median() Usage & Examples

In Polars, the median() function is used to compute the median (the middle value) of numerical columns in a DataFrame. It automatically ignores non-numeric columns and null values. The median is the middle value of a sorted dataset, making it a useful measure of central tendency, especially when the data contains outliers.

Advertisements

In this article, I will explain the Polars DataFrame median() function and by using its syntax, parameters, and usage how we can generate a new DataFrame with a single row of median values, excluding non-numeric columns.

Key Points –

  • The median() function calculates the median for numerical columns in a DataFrame.
  • Only numeric columns are considered, while string or categorical columns are ignored.
  • Calling df.median() applies to all numeric columns by default.
  • The result of df.median() is a new DataFrame with a single row containing median values.
  • median() automatically skips None or NaN values when computing medians.
  • The output is always a floating-point (f64), even if the input column is an integer type.

Polars DataFrame median() Introduction

Let’s know the syntax of the polars DataFrame.median() function.


# Syntax of median()
polars.median(*columns: str) → Expr

Parameters of the Polars DataFrame.median()

It allows only one parameter.

  • *columns – Calculates the median for each numerical column in the DataFrame or one or more specified columns.

Return Value

This function returns a new DataFrame with a single row containing median values. Non-numeric columns are ignored.

Usage of Polars DataFrame median() Function

The median() function calculates the middle value of each numeric column in a Polars DataFrame. If a column has an odd number of values, it returns the middle value; if even, it returns the average of the two middle values.

Now, let’s create a Polars DataFrame and learn how to use the median with an example.


import polars as pl

df = pl.DataFrame({
    'A': [12, 24, 46, 18, 30],
    'B': [15, 42, 26, 55, 89],
    'C': [67, 16, 71, 29, 53]
})
print("Original DataFrame:\n", df)

Yields below output.

polars median

You can calculate the median of all numeric columns in a Polars DataFrame using the median() method. This function automatically selects only numerical columns and calculates their median (middle) values.


# Compute median for all numeric columns
df2 = df.median()
print("Median of each numeric column:\n", df2)

Here,

  • The median() function calculates the median for each numeric column.
  • It calculates the median for each column:
    • A – 24.0 (sorted: [12, 18, 24, 30, 46], median = 24)
    • B – 42.0 (sorted: [15, 26, 42, 55, 89], median = 42)
    • C – 53.0 (sorted: [16, 29, 53, 67, 71], median = 53)
  • The output is a new DataFrame with the median values in floating-point (f64) format.
polars median

Computing Median for Specific Columns Using pl.median()

You can compute the median for specific columns in a Polars DataFrame using the pl.col().median() function.


# Compute median for column "A"
df2 = df.select(pl.median("A"))
print("Median of column A:\n", df2)

# Output:
# Median of column A:
# shape: (1, 1)
┌──────┐
│ A    │
│ ---  │
│ f64  │
╞══════╡
│ 24.0 │
└──────┘

Here,

  • We create a Polars DataFrame with three columns (A, B, C).
  • We use df.select(pl.median("A")) to compute the median of column "A".
  • The output is a new DataFrame with a single row containing the median value.

Computing Median for Multiple Columns

You can compute the median for multiple columns in a Polars DataFrame using pl.col().median() inside the select() method.


# Compute median for multiple columns
df2 = df.select(pl.col(["A", "B"]).median())
print("Median of columns A and B:\n", df2)

# Compute median for columns "A" and "B"
df2 = df.select(pl.median("A", "B"))
print("Median of columns A and B:\n", df2)

# Output:
# Median of columns A and B:
# shape: (1, 2)
┌──────┬──────┐
│ A    ┆ B    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 24.0 ┆ 42.0 │
└──────┴──────┘

Here,

  • pl.col(["A", "B"]).median() selects multiple columns and calculates their median values.
  • select() returns a new DataFrame containing only the median values.
  • The median is computed independently for each column.

Compute Median for a Filtered DataFrame

You can compute the median after filtering specific rows based on a condition using filter() before applying pl.median(). This is useful when you want to compute the median for a subset of your data.


# Compute median for the filtered DataFrame
filtered_df = df.filter(pl.col("A") > 30)
df2 = filtered_df.select(pl.all().median())
print("Median values for filtered DataFrame:\n", df2)

# Filter rows where column "A" is greater than 30 and compute median
df2 = df.filter(df["A"] > 30).select(pl.median("A", "B", "C"))
print("Median values for filtered DataFrame:\n", df2)

# Output:
# Median values for filtered DataFrame:
# shape: (1, 3)
┌──────┬──────┬──────┐
│ A    ┆ B    ┆ C    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 46.0 ┆ 26.0 ┆ 71.0 │
└──────┴──────┴──────┘

Here,

  • filter(pl.col("A") > 30) filters rows where column A is greater than 30.
  • select(pl.all().median()) computes the median for all remaining columns after filtering.

Handling Missing Values (None or NaN)

When handling missing values (None or NaN), Polars’ median() function automatically ignores null values by default. However, you can explicitly handle missing values in different ways.


import polars as pl

df = pl.DataFrame({
    'A': [12, None, 46, 18, 30],
    'B': [None, 42, 26, 55, 89],
    'C': [67, 16, None, 29, 53]
})

# Handling missing values
df2 = df.median()
print("Median ignoring missing values:\n", df2)

# Output:
# Median ignoring missing values:
# shape: (1, 3)
┌──────┬──────┬──────┐
│ A    ┆ B    ┆ C    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 24.0 ┆ 48.5 ┆ 41.0 │
└──────┴──────┴──────┘

Here,

  • None values are automatically ignored when computing the median.
  • The median is calculated only from existing (non-null) values in each column.

If you want to fill missing values before computing the median using the fill_null() method. This ensures that missing (None or NaN) values do not affect the computation.


# Fill missing values before computing median
df2 = df.fill_null(0).median()
print("Median after filling missing values with 0:\n", df2)

# Output:
# Median after filling missing values with 0:
# shape: (1, 3)
┌──────┬──────┬──────┐
│ A    ┆ B    ┆ C    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 18.0 ┆ 42.0 ┆ 29.0 │
└──────┴──────┴──────┘

Here,

  • fill_null(0), replaces all None values with 0 before computing the median.
  • median(), computes the median after missing values are replaced.

Compute Median with Negative Numbers

When computing the median in Polars with negative numbers, the process remains the same as with positive numbers. The median is simply the middle value when the numbers are sorted.


import polars as pl

df = pl.DataFrame({
    'A': [-12, -24, 46, 18, 30],
    'B': [15, -42, 26, -55, 89],
    'C': [67, 16, -71, 29, -53]
})

df2 = df.median()
print("Median with negative numbers:\n", df2)

# Output:
# Median with negative numbers:
# shape: (1, 3)
┌──────┬──────┬──────┐
│ A    ┆ B    ┆ C    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 18.0 ┆ 15.0 ┆ 16.0 │
└──────┴──────┴──────┘

Conclusion

In summary, the median() function in Polars is a powerful tool for computing the middle value of numeric columns in a DataFrame. It is particularly useful for analyzing datasets that include both positive and negative numbers while efficiently handling missing values (None/NaN).

Happy Learning!!

References