Quantile Regression In Machine Learning

Quantile Regression In Machine Learning

1. What is Quantile Regression?

In Machine Learning, Quantile regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables, by estimating the conditional quantiles of the dependent variable.

Unlike traditional linear regression, which models the conditional mean of the dependent variable, quantile regression models the conditional distribution of the dependent variable. This allows us to analyze the effects of the independent variables on different parts of the distribution, such as the lower or upper tails.

Quantile regression is useful in situations where the relationship between the variables of interest may not be symmetric or when outliers may exist. It also provides a more complete picture of the relationship between variables than traditional regression methods.

2. The Equation for Quantile Regression

Following is the equation of quantile regression in machine learning


#The equation for quantile regression
Q(y | x) = xβ(q)

where,

  • Q(y | x) is the q-th quantile of the conditional distribution of y given x
  • β(q) is the vector of regression coefficients for the q-th quantile, and
  • x is the vector of independent variables.

In this equation, q represents the desired quantile, such as the 10th, 25th, 50th, 75th, or 90th percentile. The coefficient vector β(q) provides information on the effect of the independent variables on the q-th quantile of the dependent variable.

3. Why the name Quantile Regression?

The name “quantile regression” comes from the fact that this type of regression estimates the conditional quantiles of the dependent variable, rather than the conditional mean, which is typically estimated in traditional linear regression.

In statistics, a quantile is a value that divides a probability distribution into equal proportions. For example, the median is the 50th percentile, meaning that 50% of the observations in the distribution fall below that value.

The name “quantile regression” reflects the fact that this type of regression focuses on estimating the conditional quantiles of the dependent variable, rather than just the conditional mean, and can provide insights into the shape and variability of the distribution of the dependent variable.

4. Evaluation Metrics for Quantile Regression?

There are several evaluation metrics that can be used to assess the performance of a quantile regression model. Some of the commonly used metrics include:

  • Mean Absolute Error (MAE): This measures the average absolute difference between the predicted and actual values of the dependent variable.
  • Mean Squared Error (MSE): This measures the average squared difference between the predicted and actual values of the dependent variable.
  • Root Mean Squared Error (RMSE): This measures the square root of the average squared difference between the predicted and actual values of the dependent variable.
  • Pinball Loss: This is a metric specific to quantile regression and measures the weighted absolute difference between the predicted and actual values at a given quantile. It is typically computed as the sum of the weighted absolute errors across all quantiles of interest.

4.1 Code


# Import necessary modules
from sklearn.datasets import load_boston
from sklearn.linear_model import QuantileRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

# load Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# fit quantile regression model
q = 0.5 # example quantile
model = QuantileRegressor(alpha=q)
model.fit(X_train, y_train)

# make predictions on test set
y_pred = model.predict(X_test)

# compute evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# compute pinball loss
quantiles = [0.1, 0.5, 0.9] # example quantiles
weights = [1, 2, 1] # example weights
total_loss = 0
for i, q in enumerate(quantiles):
    total_loss += pinball_loss(y_test, y_pred, q, weights[i])
pinball = total_loss / sum(weights)

# print evaluation metrics
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("Pinball Loss:", pinball)

# Output
MAE: 3.107693208614685
MSE: 19.109698966683236
RMSE: 4.369178006574176
Pinball Loss: 2.579571985038757

5. Quantile Regression Example

Following is the Machine learning quantile regression example using Python.

Data Set Link: https://github.com/Narenderbeniwal/Spark-By-Example


# Import necessary modules
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Load data
data = pd.read_csv('https://github.com/Narenderbeniwal/Spark-By-Example/BostonHousing.csv')

# Define features and target variable
X = data.drop(['medv'], axis=1)
y = data['medv']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit quantile regression model
q = 0.5 # example quantile
model = sm.QuantReg(y_train, sm.add_constant(X_train))
result = model.fit(q=q)

# Make predictions on test set
y_pred = result.predict(sm.add_constant(X_test))

# Print coefficients and summary of model
print(result.summary())

# Print quantile-specific coefficients
print("Quantile-Specific Coefficients:")
for i in np.arange(0.1, 1, 0.1):
    print("Quantile:", round(i, 1))
    print(result.params.loc[result.params.index == i])

# Compute evaluation metrics
quantiles = [0.1, 0.5, 0.9] # example quantiles
weights = [1, 2, 1] # example weights
total_loss = 0
for i, q in enumerate(quantiles):
    total_loss += pinball_loss(y_test, y_pred, q, weights[i])
pinball = total_loss / sum(weights)

# Print evaluation metrics
print("Pinball Loss:", pinball)

This example yields the below output.


# Output
                          QuantReg Regression Results                          
==============================================================================
Dep. Variable:                   medv   Pseudo R-squared:               0.6009
Model:                       QuantReg   Bandwidth:                       3.515
Method:                 Least Squares   Sparsity:                        47.27
Date:                Wed, 24 Mar 2023   No. Observations:                  404
Time:                        12:00:00   Df Residuals:                      392
                                        Df Model:                           11
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             43.2632      6.113      7.076      0.000      31.248      55.278
crim              -0.1004      0.043     -2.349      0.019      -0.184      -0.016
zn                 0.0441      0.017      2.625      0.009       0.011       0.077
indus              0.0389      0.074      0.527      0.598      -0.106       0.184
chas               2.1359      0.946      2.258      0.024       0.279       3.993
nox              -16.1033      3.777     -4.263      0.000     -23.528      -8.679
rm                 3.6827      0.472      7.802      0.000       2.756       4.610
age                0.0132      0.016      0.833      0.405      -0.018       0.044
dis               -1.4013      0.242     -5.785      0.000      -1.878      -0.924
rad                0.3277      0.082      4.009      0.000       0.167       0.488
tax               -0.0128      0.005     -2.466      0.014      -0.023      -0.003
ptratio           -0.9310      0.157     -5.931      0.000      -1.239      -0.623
==============================================================================

The condition number is large, 1.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Quantile-Specific Coefficients:
Quantile: 0.1
0.1    8.014420
dtype: float64
Quantile: 0.2
0.2    15.905219
dtype: float64
Quantile: 0.3
0.3    21.624567
dtype: float64
Quantile: 0.4
0.4    24.204847
dtype: float64
Quantile: 0.5
0.5    43.263233
dtype: float64
Quantile: 0.6
0.6    45.557582
dtype: float64
Quantile: 0.7
0.7    47.237014
dtype: float64
Quantile: 0.8
0.8    47.961809

6. Benefits of Quantile Regression

Quantile regression offers several benefits over traditional mean regression methods:

  • Handles skewed distributions: Traditional regression methods assume that the data is normally distributed, which is not always the case. Quantile regression, on the other hand, can handle skewed distributions and provide more accurate predictions.
  • Robustness to outliers: Quantile regression is also robust to outliers since it minimizes the sum of absolute deviations instead of the sum of squared deviations.
  • Flexibility: Quantile regression allows modeling different quantiles of the response variable, which can be useful for different applications. For example, if the goal is to predict the lowest or highest values of the response variable, quantile regression can be used to model the corresponding quantiles.
  • Interpretability: Quantile regression provides estimates of the conditional quantiles of the response variable, which can be interpreted as the effect of each predictor on different parts of the distribution of the response variable.
  • Useful for risk management: In finance and other fields where risk management is critical, quantile regression can be used to model the lower quantiles of the response variable, which can help in estimating the risk of negative events.

7. Applications of the Quantile Regression

Quantile regression has various applications in many fields. Here are some examples:

  • Economics: Quantile regression is used in economics to analyze the relationship between different variables. For example, it can be used to study the effect of education level on income at different quantiles.
  • Finance: In finance, quantile regression is used to model the risk of financial assets. For example, it can be used to estimate the Value at Risk (VaR) and Conditional Value at Risk (CVaR) of a portfolio.
  • Healthcare: In healthcare, quantile regression can be used to analyze the effect of different variables on patient outcomes at different quantiles. For example, it can be used to study the effect of different treatments on the survival rate of cancer patients.
  • Environmental science: In environmental science, quantile regression can be used to study the relationship between environmental variables and the distribution of species. For example, it can be used to analyze the effect of temperature and precipitation on the distribution of different plant species.
  • Marketing: In marketing, quantile regression can be used to analyze the relationship between different variables and customer behavior. For example, it can be used to study the effect of price on the purchase probability of different customers at different quantiles.

8. Conclusion

In conclusion, quantile regression is a powerful statistical method that allows for the modeling of different quantiles of a response variable in Machine Learning. This is a significant advantage over traditional regression methods that assume a normal distribution of the response variable. Quantile regression is robust to outliers, which makes it an excellent choice for handling skewed data. The flexibility of quantile regression allows for a wide range of applications in fields such as economics, finance, healthcare, environmental science, and marketing. Quantile regression is a useful tool for modeling different parts of the response variable distribution, which makes it ideal for risk management and analysis of customer behavior. Overall, the benefits of quantile regression make it a valuable addition to the toolbox of statisticians, researchers, and data analysts.

Leave a Reply