Linear Regression is one of the most widely used Artificial Intelligence algorithms in real-life Machine Learning problems — thanks to its simplicity, interpretability, and speed. In the next few minutes, we’ll understand what’s behind the working of this algorithm.
In this article, I will explain Linear Regression with some data, python code examples, and output.
1. Linear Regression Introduction
What is Linear Regression?
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It is a popular technique for predicting the value of the dependent variable based on the values of the independent variables. Linear regression assumes that there is a linear relationship between the dependent variable and the independent variables, which means that a change in one independent variable leads to a proportional change in the dependent variable.
1.1 Simple Linear Regression Equation
The equation for simple linear regression is as follows:
# Simple Linear Regression Equation Y = β0 + β1X + ε
Y is the dependent variable,
Xis the independent variable,
β0is the intercept,
β1is the slope,
εis the error term.
β0 is the value of
X is equal to zero, and
β1 is the change in
Y for a unit change in
ε represents the random error or noise in the data.
The goal of linear regression is to find the best-fit line or hyperplane that explains the relationship between the independent variables and the dependent variable. In simple linear regression, there is only one independent variable, while in multiple linear regression, there are multiple independent variables. The best-fit line or hyperplane is determined by minimizing the sum of squared differences between the predicted values and the actual values of the dependent variable.
Linear regression is widely used in various fields such as economics, finance, biology, social sciences, engineering, and many others. It can be used to make predictions, identify the factors that affect the dependent variable, and understand the underlying relationships between variables. However, it is important to carefully analyze the data and choose the appropriate model to ensure that the results are accurate and meaningful.
1.2 But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line. The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.
1.3 Random Error(Residuals)
In regression, the difference between the real value of the dependent variable(yi) and the predicted value(predicted) is called the residuals.
# Equation To Calculate the Random Error εi = ypredicted – yi where ypredicted = B0 + B1 Xi
1.4 What is the best fit line?
In simple terms, the best fit line is a line that fits the given scatter plot in the best way. Mathematically, the best fit line is obtained by minimizing the Error or Residual Sum of Squares(RSS).
2. Evaluation Metrics for Linear Regression
When building a linear regression model, it is important to evaluate its performance to ensure that it accurately predicts the dependent variable. There are several evaluation metrics that can be used to assess the performance of a linear regression model. In this article, we will discuss some of the most common evaluation metrics for linear regression.
Suppose we have a dataset containing information about houses in a particular city. The dataset has the following columns:
- Size (in square feet)
- Number of Bedrooms
- Price (in thousands of dollars)
Here are the first few rows of the dataset:
We want to use this dataset to build a linear regression model that can predict the price of a house based on its size and number of bedrooms.
To calculate Evaluation Metrics, we first make predictions using our linear regression model and then calculate the Evaluation Metrics
#Import The LinearRegression from sklearn from sklearn.linear_model import LinearRegression # Load the dataset X = [[1500, 3], [2000, 4], [1200, 2], [1700, 3]] y = [250, 350, 180, 280] # Fit the linear regression model model = LinearRegression() model.fit(X, y) # Make predictions on the same data y_pred = model.predict(X)
2.1 Mean Squared Error (MSE):
Mean squared error (MSE) is a common evaluation metric used in linear regression. It measures the average squared difference between the predicted values and the actual values of the dependent variable. The formula for MSE is as follows:
# Equation For MSE is MSE = (1/n) * ∑(y_pred - y_actual)^2
n is the number of data points,
y_pred is the predicted value, and
y_actual is the actual value.
MSE gives a measure of how well the model fits the data. A lower MSE indicates a better fit between the predicted and actual values.
Python Code For MSE:
# Output MSE 18.75
2.2. Root Mean Squared Error (RMSE):
Root mean squared error (RMSE) is similar to MSE, but it takes the square root of the MSE to make the units of the error the same as the units of the dependent variable. The formula for RMSE is as follows:
# Equation For RMSE is RMSE = sqrt((1/n) * ∑(y_pred - y_actual)^2)
RMSE is also commonly used as an evaluation metric for linear regression.
Python Code For RMSE:
# Output RMSE 2.8421709
2.3 Mean Absolute Error (MAE):
Mean absolute error (MAE) is another common evaluation metric for linear regression. It measures the average absolute difference between the predicted values and the actual values of the dependent variable. The formula for MAE is as follows:
# Equation For MAE is MAE = (1/n) * ∑|y_pred - y_actual|
MAE gives a measure of how well the model predicts the dependent variable. A lower MAE indicates a better prediction.
Python Code For Mean Absolute Error (MAE):
# Output MAE 18.75
2.4 R-Squared (R2):
R-squared (R2) is a metric that measures how well the model explains the variation in the dependent variable. It is a number between 0 and 1, where 1 indicates a perfect fit and 0 indicates no fit. The formula for R2 is as follows:
# Equation For R-Square is R2 = 1 - (SS_res / SS_tot)
where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.
R2 is commonly used as an evaluation metric for linear regression models. A higher R2 indicates a better fit between the predicted and actual values.
Python Code For R-Squared (R2):
# Output R-Square 1.0
In conclusion, there are several evaluation metrics that can be used to assess the performance of a linear regression model. MSE, RMSE, and MAE give a measure of how well the model fits the data and predicts the dependent variable, while R2 measures how well the model explains the variation in the dependent variable. It is important to use multiple evaluation metrics to ensure that the model is accurate and reliable.
3. Key Benefits of Linear Regression
Linear regression is a widely used technique in statistical modeling and has many benefits, including:
Linear regression is a simple and easy-to-understand model. The basic idea is to find a line that best fits the data, which makes it intuitive and easy to explain to others.
The coefficients in a linear regression model have a clear interpretation. For example, in a simple linear regression model, the coefficient represents the change in the dependent variable for every one-unit increase in the independent variable.
Linear regression can be used for both simple and complex models. It can handle multiple independent variables, interactions between variables, and polynomial relationships between variables.
3.4 Predictive power:
Linear regression can be used to make predictions about future outcomes based on historical data. This makes it a valuable tool for forecasting and trend analysis.
Linear regression is a computationally efficient model, which means it can handle large datasets quickly and easily.
3.6 Assumptions and limitations:
Linear regression has clear assumptions and limitations, which can help identify potential issues and ensure that the model is appropriate for the data. This can lead to more accurate predictions and better insights.
4. Assumptions Of the Linear Regression
Linear regression is a powerful and widely used statistical method for modeling relationships between dependent and independent variables. However, linear regression models make certain assumptions about the data that must be met for the model to be valid. Here are the key assumptions for linear regression:
The relationship between the dependent variable and each independent variable should be linear. This means that changes in the independent variable should result in proportional changes in the dependent variable.
The observations should be independent of each other. This means that there should be no relationship between the errors (residuals) and the independent variables.
The variance of the residuals should be constant across all levels of the independent variables. In other words, the spread of the residuals should be the same throughout the range of the independent variables.
The residuals should be normally distributed. This means that the residuals should be symmetrically distributed around a mean of zero.
4.5 No multicollinearity:
There should be no high correlation between the independent variables. Multicollinearity can lead to unstable estimates of the regression coefficients and can make it difficult to interpret the model.
5. Applications of the Linear Regression:
Linear regression is a widely used statistical technique that has numerous applications in various fields. Some of the applications of linear regression are:
In economics, linear regression is used to study the relationship between variables such as supply and demand, price and quantity, and unemployment and inflation.
Linear regression is widely used in finance to model the relationship between stock prices and other financial variables such as interest rates, GDP growth rates, and inflation rates.
Linear regression is used in marketing to predict the sales of a product based on various factors such as advertising expenditure, price, and consumer demographics.
5.4 Medical research:
Linear regression is used in medical research to study the relationship between various medical conditions and factors such as age, gender, and lifestyle habits.
5.5 Environmental science:
Linear regression is used in environmental science to model the relationship between various environmental factors such as temperature, precipitation, and air pollution.
In education, linear regression is used to study the relationship between various factors such as student grades, attendance, and socioeconomic status.
Linear regression is used in sports to predict the outcome of games based on various factors such as team performance, player statistics, and game conditions.
Overall, linear regression is a powerful and versatile statistical technique that has numerous applications in various fields. Its simplicity, interpretability, and flexibility make it a popular choice for analyzing and modeling relationships between variables.
6. Challenges and limitations of linear Regression
Linear regression is a widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. Despite its popularity and usefulness, linear regression also has several limitations and challenges that should be considered. Here are some of the key challenges and limitations of linear regression:
6.1 Linearity assumption:
The basic assumption of linear regression is that the relationship between the dependent variable and the independent variables is linear. However, in real-world situations, this assumption may not always hold, and the relationship between variables may be more complex.
Linear regression is sensitive to outliers, which are data points that lie far away from the majority of the data. Outliers can have a significant impact on the regression line and can distort the results of the analysis.
Overfitting occurs when the model is too complex and captures noise in the data rather than the underlying relationship between variables. This can result in a model that fits the training data very well but performs poorly on new, unseen data.
Underfitting occurs when the model is too simple and does not capture the underlying relationship between variables. This can result in a model that is unable to explain the data well, even in the training set.
Multicollinearity occurs when there is a high degree of correlation between the independent variables in the model. This can make it difficult to determine the individual effects of each variable on the dependent variable.
6.6 Limited to linear relationships:
Linear regression is only suitable for modeling linear relationships between variables. If the relationship is nonlinear, other methods such as polynomial regression or nonlinear regression may be more appropriate.
6.7 Lack of robustness:
Linear regression is sensitive to small changes in the data, and the results of the analysis may change significantly if the data is modified or if new data is added.
6.8 Assumes normality:
Linear regression assumes that the residuals (the difference between the predicted and actual values) are normally distributed. If this assumption is violated, the results of the analysis may be unreliable.
In conclusion, linear regression is a powerful tool for modeling the relationship between variables, but it has several limitations and challenges that must be taken into account when using it. Careful consideration of these limitations and proper analysis of the results can help ensure that linear regression is used appropriately and effectively.
- Data science vs Data Analysis Explained
- Data Science Vs Machine Learning
- Classification in Machine Learning
- Exploring Machine Learning Datasets
- Machine Learning Applications
- Machine Learning Features
- Natural Language Processing(NLP) with Machine Learning
- Machine Learning in Healthcare
- Machine Learning Tools
- Machine Learning in Finance
- Machine Learning Pipeline
- Difference Between Linear Regression and Logistic Regression
- Difference Between Linear Regression and Polynomial Regression
- Entropy In Machine Learning
- Gradient Descent In Machine Learning
- Machine Learning Introduction
- Machine Learning Life Cycle
- Artificial Intelligence vs Machine Learning vs Deep Learning