Simple and Multiple Linear Regression Formulas explained

An easy-to-read mathematical explanation of Simple and Multiple Linear Regression Formulas

Gabriel Furnieles
8 min readAug 23, 2022

Most people know about Linear Regression and its application but only a few have gone deeper and asked themselves where the equations and mathematical formulas that they are using come from (Yes, I am one of them).
Since I feel this is a non-very popular topic on the internet and many articles and videos don’t take the time to explain these concepts, I have set out to do it in this post.

But first, for those who have just landed here and have no idea about Linear Regression let’s do a brief introduction (if you already know Linear Regression and just want to know where the formulas come from, feel free to skip this section and read the second and third ones)

Note. Throughout this article I am using the book An Introduction to Statistical Learning as a reference. If you are more interested in Linear Regression or Machine Learning I encourage you to have a look at it, it is really worth it.
All the plots and formulas have been generated by me.

1. What is Linear Regression?

Linear regression is a Machine Learning algorithm whose purpose is to fit a set of data points using a linear model (a straight line in 2D) so that afterward we can make predictions or make inferences about the data.

In a Linear Regression problem, we have a set of predictor variables X₁, X₂, …, Xp and a unique response variable Y, and the aim is to explain the response variable with the predictors using a linear model.

Linear equation. The task of Linear Regression is to calculate the parameters β0, β1, …, βp so that the linear model is the best fit for the data points. Notation: β0 is called the interception term, and β1, β2, …, βp are the coefficients

The difference between Simple and Multiple Linear regression is the number of predictors:
- 1 predictor (X): Simple Linear Regression
- 2 or more predictors (X₁, X₂, …, Xp): Multiple Linear regression

To understand this better, let’s introduce a simple example where we just have 1 predictor variable (Simple Linear Regression).
Imagine we have gathered some data about the performance of 100 data scientist students in a statistics exam. We are studying the grade obtained for each student over the number of hours they have spent studying, and we would like to draw a straight line that best fits the data so that we can determine if there exists a relationship between the grade and the hours of study.

Results obtained after the exam for the 100 students.

In this scenario, the predictor X is the hours of study and the response variable Y is the grade obtained by the student.
Since we only have 1 predictor, our linear model has the following shape:

Line equation. The task of Simple Linear Regression is to calculate the parameters β0 and β1 so that the line is the best fit to the data points

Now we can develop a Simple Regression Analysis and obtain the following line:

Regression line obtained after the analysis

Finally, the results obtained for β0 and β1 are:
- β0 = 1.95361788
- β1 = 0.29338499

Regression Line equation for Grade over Hours of study

This means that if we don’t study anything, i.e. we study 0h (X = 0), our average grade will be 1.95361788.
And that for every hour of study our grade will increase an average of 0.29338499 points.

At this point, we are in conditions to extract some additional inferences about the data (see next section), but I will only limit myself to listing them since they require other concepts and statistical tools of Linear Regression that are beyond the objective of this article

(It really gets me on my nerves when people say that because it’s like they’re hiding information from you but otherwise the article would be as extended as a book and we don’t like that. If you are more interested in Linear Regression I encourage you to read An Introduction to Statistical Learning, which I’m using as a reference book for this article, or you just can follow me for more statistical and data science content)

Why and when use Linear Regression?

Linear regression is not very often used for predicting but rather for making inferences (obtaining useful information and conclusions about the data) since it offers a non-flexible fit.

Left: A flexible model better fits the data points yielding better predictions (when x = 4.5 the model predicts y = 4.375). Right: The Linear Regression line can capture the data trend but returns worse predictions (when x = 4.5 predicts y =14.994).

Note that we are not forcing the line to pass through the points (what in mathematics is called interpolation) — since it wouldn’t be possible to do it with a single straight line — but we are looking for the line that passes closest to them.

However, Linear Regression can also be very useful when analyzing data. Among the inferences we can extract using Linear Regression we can find the following ones:

  • Find if there is a relationship between one variable (or group of variables) and the response one, and calculate how strong is that relationship.
  • Compute the effect of each predictor variable on the response and know which predictor contributes the most.
  • Find if the relation is linear and how accurately we can make future predictions on the response using a linear model.
  • Find if there is synergy (interaction) between the predictors and how we can improve our linear model to make better predictions.
  • Compute the trend of the data.

2. Simple Linear Regression

After a not-very-brief introduction to Linear Regression (I apologize), it is time to explain the true theme and purpose of this article, THE FORMULAS (I promise to be short and stick to the point)

Simple Linear Regression is used when we have only 1 predictor variable X that we want to use to explain the response variable Y.

Simple Linear Model. We want to estimate β0 and β1

But before starting to generate random values for β0 and β1 we need a selection method to decide which line is better than the others.

So how do we choose the best fitting line? RSS (Residual Sum of Squares)

In order to pick the best fitting line, we need to establish a fitting measurement that will tell us how good or bad our line fits the data (The measurements that measure the performance of a model are called Loss functions)

In our linear regression problem, a good fitting measurement is to take the distance between the predicted Ŷ value using the fitting line and the Y true value from our data, square the result so that we only get positive values, and compute the summation for all the data points.

As seen in the plot, the residuals are the distance between the data points (sky blue) and the predicted value by the regression line. Then we square them to not have negative values ​​that can make up the true loss value and add all of them.

This measurement is called Residual Sum of Squares (RSS) and mathematically it is expressed by the formula:

Residual Sum of Squares formula. ŷi represents the predicted value by the linear model (ŷi = β0 + β1·xi) and yi the true value from the data. The formula iterates over all the data points.

The formula

We have just defined how are we going to measure which one is the best fitting line (great!) but, how do we get to the easy and fast computational formulas?

Objective: Minimize the loss function RSS over the parameters β0 and β1 (i.e. find the β0 and β1 that minimizes RSS)

1. Developing the expression:

Developing RSS expression

2. Now, if we take the partial derivatives over the parameters and set them equal to zero:

Partial derivatives of RSS over β0
Partial derivatives of RSS over β1

Thus obtaining the following system:

Linear system that describes the minimum point of RSS for β0 and β1

Note. How do we know this is a minimum and not a maximum?
Intuitively, we know that we are trying to minimize a loss function that describes how good our model is. Therefore, there is no upper limit for the loss, our model can always be worse and lead to higher losses, but there is a lower limit where the error is as close as possible to 0 and cannot be lower (the model has its limitations, and is not possible for the loss to be exactly 0)

Mathematically, the point (β0,β1) is called a stationary point in multivariable calculus, and we can classify it by computing the second partial derivatives, check this pdf [2].

Finally, we also know this is a global minimum and not a local one because we only obtained one point (β0,β1) when studying the first derivatives (the system has a unique solution).

3. To solve the system we can write the expression in matrix form:

The above system in matrix form

And therefore we obtain the final equation for Simple Linear Regression

The final equation for Simple Linear Regression

3. Multiple Linear Regression

Multiple Linear Regression is an extension of the Simple model for more than 1 predictor. In this case, we have a set of predictor variables X₁, X₂, …, Xp that we want to use to explain the response variable Y.

Multiple Linear Model. Now the task of Linear Regression is to calculate the parameters β0, β1, …, βp that better fit the data.

The procedure is the same as in the Simple model. To simplify the mathematical notation I will proceed to explain the formula for 2 predictors X₁ and X₂, but it is the same procedure for more predictors (3,4, …)

Our linear model for two predictors

Objective: Same as in the Simple model: Minimize the loss function RSS over the parameters β0, β1, and β2 (i.e. find the β0, β1, and β2 that minimize RSS)

1. Developing the expression:

Note that now the predicted y (ŷi) is different because it includes β0, β1, and β2, and the two predictors X₁ and X₂

2. Take the partial derivatives over the parameters (the three of them! β0, β1, and β2) and set them equal to zero:

Partial derivatives of RSS over β0
Partial derivatives of RSS over β1
Partial derivatives of RSS over β2

Order the system:

Linear system for Multiple regression formula

3. And write the expression in matrix shape:

The above system in matrix shape

And therefore we obtain the final equation for Multiple Linear Regression:

The final equation for Multiple Linear Regression using 2 predictors

Finally, we can generalize the above system to p predictors:

The generalized equation for Multiple Linear Regression using p predictors

References

[1] An Introduction to Statistical Learning. G. James, D. Witten, T. Hastie and R. Tibshirani

An Introduction to Statistical Learning

[2] Max/min for functions of two variables. S. Zelik

--

--

Gabriel Furnieles

Mathematical engineering student specializing in AI and ML. I write casually on data science topics. www.linkedin.com/in/gabrielfurnielesgarcia