In Polars, the describe()
function computes summary statistics for numerical columns in a DataFrame, similar to pandas.DataFrame.describe()
. It offers a quick overview of key metrics such as count, mean, standard deviation, minimum, maximum, and percentiles (25%, 50%, 75%). Additionally, for non-numerical data, it provides insights like count, unique values, most frequent values, and their frequency.
In this article, I will explore the Polars DataFrame describe()
function, covering its syntax, parameters, usage, and how to retrieve summary statistics exclusively for numerical columns.
Key Points –
- Provides a statistical summary of numerical and categorical columns in a DataFrame.
- Computes count, mean, standard deviation, min, max, and percentiles (25%, 50%, 75%).
- Allows specifying custom percentiles using a sequence of float values.
- Provides multiple interpolation methods (
nearest
,lower
,higher
,midpoint
,linear
) for percentiles. - Optimized for performance, making it faster than similar functions in Pandas.
- Does not modify the original DataFrame, returning a new summary table instead.
- Handles missing values, showing
null_count
for each column. - Returns a DataFrame containing summary statistics, making it easy to use in further analysis.
Syntax of Polars DataFrame describe() Function
Following is the syntax of the Polars DataFrame describe()
# Syntax of describe()
DataFrame.describe(
percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75),
*,
interpolation: RollingInterpolationMethod = 'nearest',
) → DataFrame[source]
Parameters of the DataFrame describe()
Following are the parameters of the polars DataFrame describe() function
percentiles
– (Sequence[float] | float | None, default = (0.25, 0.5, 0.75))- Specifies which percentiles to compute.
- Can be a single float (e.g.,
0.5
for median) or a list of floats (e.g.,[0.1, 0.5, 0.9]
). - If
None
, no percentiles are computed.
interpolation
(str, default ="nearest"
) – Defines how to compute percentiles when the index is not an exact match."nearest"
(default) – Chooses the closest index."lower"
– Uses the next lowest index."higher"
– Uses the next highest index."midpoint"
– Averages the two nearest indices."linear"
– Linearly interpolates between two points.
Return Value
- This function returns Polars DataFrame with summary statistics for numerical columns only.
Usage of Polars DataFrame describe() Function
The describe()
function in Polars is used to generate a quick statistical summary of a DataFrame. It provides key metrics for numerical and categorical columns, making it useful for data exploration and analysis. It generates key statistics such as count, mean, standard deviation, minimum, and maximum for numerical columns.
To run some examples of the Polars DataFrame describe() function, let’s create a Polars DataFrame.
import polars as pl
# Creating a new Polars DataFrame
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[25000,22000,26000,23000,30000],
'Discount':[1000,1300,2000,1500,2500]
}
df = pl.DataFrame(technologies)
print("Original DataFrame:\n",df)
Yields below output.
To get summary statistics for all numerical columns in the Polars DataFrame, use the describe()
function.
# Get summary statistics
df2 = df.describe()
print("Summary Statistics:\n", df2)
Here,
- For numerical columns (
Fee
,Discount
), it displays the count, number of null values, mean, standard deviation (std), minimum, and maximum values. - For string columns (
Courses
), it shows the count, number of null values, minimum, and maximum values. - It calculates count, mean, standard deviation, min, max, and percentiles (25%, 50%, 75%).
Summary Statistics for a Specific Column
To get summary statistics for a specific column in a Polars DataFrame, use the select()
method combined with the describe()
function.
# Get summary statistics for a specific column (e.g., 'Fee')
df2 = df.select("Fee").describe()
print("Summary Statistics for 'Fee' Column:\n", df2)
# Output:
# Summary Statistics for 'Fee' Column:
# shape: (9, 2)
┌────────────┬───────────┐
│ statistic ┆ Fee │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════════╪═══════════╡
│ count ┆ 5.0 │
│ null_count ┆ 0.0 │
│ mean ┆ 25200.0 │
│ std ┆ 3114.4823 │
│ min ┆ 22000.0 │
│ 25% ┆ 23000.0 │
│ 50% ┆ 25000.0 │
│ 75% ┆ 26000.0 │
│ max ┆ 30000.0 │
└────────────┴───────────┘
Here,
- Use
select("Fee")
to filter only the required column before applyingdescribe()
. - The output includes count, mean, std, min, max, and percentiles (25%, 50%, 75%).
- The
"Courses"
column (which is non-numeric) is excluded.
Summary Statistics for Only Numerical Columns
To get summary statistics for only numerical columns in a Polars DataFrame, use the select() method to exclude non-numeric columns before applying the describe()
function.
# Select only numerical columns and get summary statistics
df2 = df.select(pl.col(pl.NUMERIC_DTYPES)).describe()
print("Summary Statistics for Numerical Columns:\n", df2)
# Output:
# Summary Statistics for Numerical Columns:
# shape: (9, 3)
┌────────────┬───────────┬────────────┐
│ statistic ┆ Fee ┆ Discount │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞════════════╪═══════════╪════════════╡
│ count ┆ 5.0 ┆ 5.0 │
│ null_count ┆ 0.0 ┆ 0.0 │
│ mean ┆ 25200.0 ┆ 1660.0 │
│ std ┆ 3114.4823 ┆ 594.138031 │
│ min ┆ 22000.0 ┆ 1000.0 │
│ 25% ┆ 23000.0 ┆ 1300.0 │
│ 50% ┆ 25000.0 ┆ 1500.0 │
│ 75% ┆ 26000.0 ┆ 2000.0 │
│ max ┆ 30000.0 ┆ 2500.0 │
└────────────┴───────────┴────────────┘
Here,
- The
select(pl.col(pl.NUMERIC_DTYPES))
ensures only numerical columns are considered. - The
"Courses"
column (which is non-numeric) is excluded from the summary. - The
describe()
function computes count, mean, std, min, max, and percentiles.
Customizing Percentiles with Linear Interpolation
Polars’ describe()
function generates statistical summaries for numerical columns, including percentiles. By default, it displays the 25th, 50th (median), and 75th percentiles, but you can customize these values and choose an interpolation method for cases where a requested percentile falls between two data points. One such method, linear interpolation, calculates percentile values by applying a weighted average between the closest data points.
# Customizing percentiles with linear interpolation
df2 = df.describe(
percentiles=[0.1, 0.3, 0.5, 0.7, 0.9], # Custom percentiles
interpolation="linear" # Explicitly setting interpolation method
)
print("Customizing percentiles with linear interpolation:\n", df2)
# Output:
# Customizing percentiles with linear interpolation:
# shape: (11, 4)
┌────────────┬─────────┬───────────┬────────────┐
│ statistic ┆ Courses ┆ Fee ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪═════════╪═══════════╪════════════╡
│ count ┆ 5 ┆ 5.0 ┆ 5.0 │
│ null_count ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 25200.0 ┆ 1660.0 │
│ std ┆ null ┆ 3114.4823 ┆ 594.138031 │
│ min ┆ Hadoop ┆ 22000.0 ┆ 1000.0 │
│ … ┆ … ┆ … ┆ … │
│ 30% ┆ null ┆ 23400.0 ┆ 1340.0 │
│ 50% ┆ null ┆ 25000.0 ┆ 1500.0 │
│ 70% ┆ null ┆ 25800.0 ┆ 1900.0 │
│ 90% ┆ null ┆ 28400.0 ┆ 2300.0 │
│ max ┆ Spark ┆ 30000.0 ┆ 2500.0 │
└────────────┴─────────┴───────────┴────────────┘
Here,
- Custom percentile selection (e.g., 10%, 30%, 70%)
- Linear interpolation ensures precise percentile calculations
- Ideal for advanced statistical analysis in Polars
Excluding NaN Values in Summary Statistics
The describe()
function automatically excludes NaN (Not a Number) values while computing summary statistics. However, if your DataFrame contains missing values (null
or NaN
), you can explicitly handle them before using describe()
function.
import polars as pl
# Creating a Polars DataFrame with NaN values
technologies = {
'Courses': ["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
'Fee': [25000, None, 26000, 23000, 30000], # Contains a missing value
'Discount': [1000, 1300, None, 1500, 2500] # Contains a missing value
}
df = pl.DataFrame(technologies)
# Drop NaN values before describing
df2 = df.drop_nulls().describe()
print("Describe excluding nan values:\n",df2)
# Output:
# Describe excluding nan values:
# shape: (9, 4)
┌────────────┬─────────┬─────────────┬─────────────┐
│ statistic ┆ Courses ┆ Fee ┆ Discount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪═════════╪═════════════╪═════════════╡
│ count ┆ 3 ┆ 3.0 ┆ 3.0 │
│ null_count ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 26000.0 ┆ 1666.666667 │
│ std ┆ null ┆ 3605.551275 ┆ 763.762616 │
│ min ┆ Pandas ┆ 23000.0 ┆ 1000.0 │
│ 25% ┆ null ┆ 25000.0 ┆ 1500.0 │
│ 50% ┆ null ┆ 25000.0 ┆ 1500.0 │
│ 75% ┆ null ┆ 30000.0 ┆ 2500.0 │
│ max ┆ Spark ┆ 30000.0 ┆ 2500.0 │
└────────────┴─────────┴─────────────┴─────────────┘
Here,
drop_nulls()
, remove rows with any missing values.- NaN values are excluded when computing mean, standard deviation, min, max, etc.
- The
"null_count"
row confirms that all NaN values are removed.
Conclusion
In conclusion, the describe()
function in Polars is a powerful tool for quickly summarizing numerical columns in a DataFrame. It generates key statistics like count, mean, standard deviation, minimum, maximum, and percentiles while effectively managing missing values.
Happy Learning!!
Related Articles
- Polars DataFrame head() Function
- Polars DataFrame sample() Method
- Polars DataFrame quantile() Method
- Polars DataFrame max() Method
- Polars DataFrame select() Method
- Convert Polars Cast Int to String
- Convert Polars Cast String to Float
- Convert Polars Cast Float to String
- Polars DataFrame shift() Usage & Examples
- Polars DataFrame schema() Usage & Examples