# Multiple Linear Regression: Explained, Coded Special Cases

*This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.*

Linear Regression is famously known for being a simple algorithm and a good baseline to compare more complex models to. In this article, we explore the algorithm and turn the math into code, and then we run the code on a dataset, to get predictions on new data.

# Table Of Contents (Click To Scroll)

- What Is Linear Regression?
- Multiple Linear Regression
- Special Case 1: Simple Linear Regression
- Special Case 2: Polynomial Regression

## What Is Linear Regression?

The Linear Regression model consists of one equation of linearly increasing *variables* (also called *parameters* or *features*), along with a coefficient estimation algorithm called least squares, which attempts to figure out the best possible coefficient given a variable.

Linear regression models are known to be simple and easy to implement, because there is no advanced mathematical knowledge needed, except for a bit of linear algebra. For this reason, many people choose to use a linear regression model as a baseline model, to compare if another model can outperform such a simple model.

## Multiple Linear Regression

Multiple linear regression is a model that can capture the a linear relationship between multiple variables/features – assuming that there is one. The general formula for multiple linear regression looks like the following:

y = beta_0 + beta_1 x_1 + beta_2 x_2 + … + beta_i x_i + varepsilon

$$

- $beta_0$ is known as the intercept
- $beta_1$ to $beta_i$ are known as coefficients
- $x_1$ to $x_i$ are the features of our dataset
- $varepsilon$ are the residual terms

We can also represent the formula for linear regression in vector notation. When representing the formula in vector notation, we have the advantage of using some operations from linear algebra, which in turn makes it easier to code.

mathbf {y} =

{begin{bmatrix}

y_{1}\y_{2}\vdots \y_{n}

end{bmatrix}},quad

{displaystyle X={

begin{bmatrix}

mathbf {x} _{0}^{mathsf {T}}\

mathbf {x} _{1}^{mathsf {T}}\

vdots \

mathbf {x} _{n}^{mathsf {T}}

end{bmatrix}}

,quad }

{displaystyle {boldsymbol {beta }}={

begin{bmatrix}

beta _{0}\

beta _{1}\

beta _{2}\

vdots \

beta _{n}

end{bmatrix}},quad

{boldsymbol {varepsilon }}={

begin{bmatrix}

varepsilon _{1}\

varepsilon _{2}\

vdots \

varepsilon _{n}

end{bmatrix}}}

$$

Note that the $x_0^T$ vector contains just a series of 1’s: `[1, 1, ..., 1]`

. This turns our equation into something much more compact, where all our terms are now represented as matrices. The following equation shows that we can compute the output value for all *y*, given that we have an estimation of the coefficients $boldsymbol{beta }$.

mathbf{y} = mathbf{X} boldsymbol{beta } + boldsymbol{varepsilon}

$$

But how exactly can we estimate the coefficient values?

### Ordinary Least Squares

Linear least squares (LLS) is the main algorithm for estimating coefficients of the one formula just presented. In Machine Learning language, this is known as fitting your model to the dataset. We will focus on the most popular variant called Ordinary Least Squares (OLS).

The OLS algorithm minimizes *the sum of squares of residuals*.

One residual is the distance from one data point to the line in 2D or plane in 3D, so when we minimize the total sum of the squared residuals, we are minimizing the average distance from a data point to the line we are fitting to.

The formula for the algorithm can be quite intimidating, if one is not familiar with linear algebra: permuted matrices, dimensionality, dot product and the inverse of a matrix. This is not a topic for this article, but read Inverting A Matrix and the full derivation of the formula by Arthur S. Goldberger, also called the normal equation, if you want to be more knowledgeable on those topics.

The following formula ensures that the resulting coefficients defines a minimum for the normal equation, which means that the result is the minimized total sum of squared residuals:

{displaystyle {hat {boldsymbol {beta }}}=(mathbf {X} ^{rm {T}}mathbf {X} )^{-1}mathbf {X} ^{rm {T}}mathbf {y}}

$$

The formula can be coded in one line of code, because it’s just a few operations. We will see that later on in the coding section.

The outcome of the algorithm, *beta hat* $boldsymbol{hat{beta}}$, is a vector containing all the coefficients, that can be used to make predictions using the formula presented in the beginning for multiple linear regression. To give an example in 3D, we could have this set of coefficients `[2.1, 5.3, 9.2]`

, which can be plugged into the equation for multiple linear regression:

y = 2.1 + 5.3 times x_1 + 9.2 times x_2

$$

This means the intercept is at 2.1, while there are two variables that have their estimated coefficients as 5.3 and 9.2. To estimate an output variable *y*, one would need to input two variables *x1* and *x2* into the equation, and then we have made a prediction.

We will see examples for this in the coding section, though there is one important thing missing; assuming we have a number of predictions for some observed data, how can we measure how well the model predicted the ground truth?

### R Squared: Accuracy Of The Multiple Linear Regression Model

We can measure the accuracy of how well the multiple linear regression model performs by a metric called *r squared*.

R^2 = 1 – frac

{ sum_{i=0}^{n} (y_i – hat{y})^2 }

{ sum_{i=0}^{n} (y_i – bar{y})^2 }

$$

- $y_i$ is one value of y at index i.
- $hat{y}$ is pronounced as y hat, and it is the predicted values of y
- $bar{y}$ is pronounced as y bar, and it is the average of y

The metric measures the relationship between the residual sum of squares (RSS) and the total sum of squares (TSS). The RSS is computed as the ground truth minus the predicted ground truth, while the TSS is computed as the ground truth minus the average of the ground truth.

### Code For Multiple Linear Regression

We have made a single class called `MultipleLinearRegression`

in a `multiple_linear_regression.py`

file, that represents all of what we have talked about until this point.

```
import numpy as np
import copy
class MultipleLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x, y):
# prepare x and y values for coefficient estimates
x = self._transform_x(x)
y = self._transform_y(y)
betas = self._estimate_coefficients(x, y)
# intercept becomes a vector of ones
self.intercept = betas[0]
# coefficients becomes the rest of the betas
self.coefficients = betas[1:]
def predict(self, x):
'''
y = b_0 + b_1*x + ... + b_i*x_i
'''
predictions = []
for index, row in x.iterrows():
values = row.values
pred = np.multiply(values, self.coefficients)
pred = sum(pred)
pred += self.intercept
predictions.append(pred)
return predictions
def r2_score(self, y_true, y_pred):
'''
r2 = 1 - (rss/tss)
rss = sum_{i=0}^{n} (y_i - y_hat)^2
tss = sum_{i=0}^{n} (y_i - y_bar)^2
'''
y_values = y_true.values
y_average = np.average(y_values)
residual_sum_of_squares = 0
total_sum_of_squares = 0
for i in range(len(y_values)):
residual_sum_of_squares += (y_values[i] - y_pred[i])**2
total_sum_of_squares += (y_values[i] - y_average)**2
return 1 - (residual_sum_of_squares/total_sum_of_squares)
def _transform_x(self, x):
x = copy.deepcopy(x)
x.insert(0, 'ones', np.ones( (x.shape[0], 1) ))
return x.values
def _transform_y(self, y):
y = copy.deepcopy(y)
return y.values
def _estimate_coefficients(self, x, y):
'''
β = (X^T X)^-1 X^T y
Estimates both the intercept and all coefficients.
'''
xT = x.transpose()
inversed = np.linalg.inv( xT.dot(x) )
coefficients = inversed.dot( xT ).dot(y)
return coefficients
```

We needed a dataset to put our new multiple linear regression algorithm to use, so we loaded in a dataset called Boston Housing Prices from Scikit-Learn. We split the dataset into inputs *x* and outputs *y*, and we further split *x* and *y* into data we train on and data we test on. This is in a `load_dataset.py`

file.

```
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.model_selection import train_test_split
def sklearn_to_df(data_loader):
X_data = data_loader.data
X_columns = data_loader.feature_names
X = pd.DataFrame(X_data, columns=X_columns)
y_data = data_loader.target
y = pd.Series(y_data, name='target')
return x, y
x, y = sklearn_to_df(load_boston())
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42)
```

The following is the start of the `main.py`

file for running our multiple linear regression.

- Load all dependencies and data
- Initialize the class
`MultipleLinearRegression()`

as an object - Fit the object to our data by
`mlr.fit(x_train, y_train)`

- Make predictions on a test set that is unseen to the algorithm by
`mlr.predict(x_test)`

. - Compute the r squared score by
`mlr.r2_score(y_test, pred)`

.

```
from load_dataset import x_train, x_test, y_train, y_test
from multiple_linear_regression import MultipleLinearRegression
from sklearn.linear_model import LinearRegression
# # # # # # # # # # # # # # # # #
# Our Multiple Linear Regression #
# # # # # # # # # # # # # # # # #
mlr = MultipleLinearRegression()
# fit our LR to our data
mlr.fit(x_train, y_train)
# make predictions and score
pred = mlr.predict(x_test)
# calculate r2_score
score = mlr.r2_score(y_test, pred)
print(f'Our Final R^2 score: {score}')
```

The output score turns out to be:

```
Our Final R^2 score: 0.6687594935355938
```

Although we did just implement multiple linear regression, let’s compare how well our implementation works in comparison to Scikit-Learn’s linear regression:

```
# # # # # # # # # # # # # # # # # #
# Scikit-Learn's Linear Regression #
# # # # # # # # # # # # # # # # # #
sk_mlr = LinearRegression()
# fit scikit-learn's LR to our data
sk_mlr.fit(x_train, y_train)
# predicts and scores
sk_score = sk_mlr.score(x_test, y_test)
print(f'Scikit-Learn's Final R^2 score: {sk_score}')
```

The output from their model to ours is almost identical, except for after the 13th decimal:

```
Scikit-Learn's Final R^2 score: 0.6687594935356329
```

## Special Case 1: Simple Linear Regression

Simple Linear Regression can be expressed in one simple equation

y = text{intercept} + text{coefficient} times x_{value}

$$

The intercept is often known as beta zero $beta_0$ and the coefficient as beta 1 $beta_1$. The equation is equal to the equation for a straight line

y = beta_0 + beta_1 x_1 = mx + b

$$

That is all there is to a simple linear regression equation, though, how do we determine the intercept and coefficient? This is where we introduce the least squares algorithm.

### Ordinary Least Squares

Ordinary Least Squares is known to minimize the *sum of squared residuals (SSR)*. There are two central parts to ordinary least squares in this special case: estimating the coefficients and estimating the intercept.

#### Estimating The Coefficient

We can estimate the coefficient (the slope) by finding the covariance between x and y, and dividing it by the variance of x. It should be noted that this is a special case of linear regression, derived using calculus from the formula as seen in multiple linear regression earlier on in this article.

text{coefficient} = beta_1 =

frac

{cov(x,y)}

{var(x)}

$$

This turns into something a little bit more complex. Any time we have a summation symbol $sum$, the equivalent in programming is summing the elements of an array. We start by adding the first element to the sum where index *i=0*, and we go all the way to the length of the array *n*. The covariance and variance can be expanded to the following formula:

text{coefficient} = beta_1 =

frac

{sum_{i=0}^{n} (x_i – bar{x})(y_i – bar{y})}

{sum_{i=0}^{n}(x_i – bar{x})^2 }

$$

- $x_i$ is one value of
*x*at index*i*. - $y_i$ is one value of
*y*at index*i*. - $bar{x}$ is pronounced as
*x bar*, and it is the average of*x* - $bar{y}$ is pronounced as
*y bar*, and it is the average of y

#### Estimating The Intercept

The estimate of the intercept $beta_0$ should be easier to understand than the estimate of the coefficient $beta_1$.

beta_0 = bar{y} – beta_1 times bar{x}

$$

- $bar{x}$ is pronounced as
*x bar*, and it is the average of*x* - $bar{y}$ is pronounced as
*y bar*, and it is the average of y - $beta_1$ is the coefficient we estimated from earlier

### Code For Simple Linear Regression

We have made a single class called `SimpleLinearRegression`

in a `simple_linear_regression.py`

file, that implements the special case of simple linear regression.

```
import numpy as np
class SimpleLinearRegression():
def __init__(self):
self.coefficient = None
self.intercept = None
def fit(self, x, y):
'''
Given a dataset with 1 input feature x and output feature y,
estimates the coefficient and compute the intercept.
'''
self.coefficient = self._coefficient_estimate(x, y)
self.intercept = self._compute_intercept(x, y)
def predict(self, x):
'''
y = b_0 + b_1*x
'''
x_times_coeff = np.multiply(x, self.coefficient)
return np.add(x_times_coeff, self.intercept)
def r2_score(self, y_true, y_pred):
'''
r2 = 1 - (rss/tss)
rss = sum_{i=0}^{n} (y_i - y_hat)^2
tss = sum_{i=0}^{n} (y_i - y_bar)^2
'''
y_average = np.average(y_true)
residual_sum_of_squares = 0
total_sum_of_squares = 0
for i in range(len(y_true)):
residual_sum_of_squares += (y_true[i] - y_pred[i])**2
total_sum_of_squares += (y_true[i] - y_average)**2
return 1 - (residual_sum_of_squares/total_sum_of_squares)
def _compute_intercept(self, x, y):
'''
intercept = y_bar - coefficient*x_bar
WHERE: y_bar = average(y),
x_bar = average(x),
coefficient = coefficient already estimated
'''
# find the average of the array x
x_average = np.average(x)
# multiply the coefficient and the average of the x values
mul = self.coefficient*x_average
return np.average(y) - mul
def _coefficient_estimate(self, x, y):
'''
coefficient_{x,y} = ∑_{i=0}^{n} (x_i - x_bar) * (y_i - y_bar)
_________________________________________
∑_{i=0}^{n} (x_i - x_bar)^2
'''
numerator = 0
denominator = 0
for i in range(len(x)):
x_value = x[i]
y_value = y[i]
x_average = np.average(x)
y_average = np.average(y)
numerator += (x_value - x_average) * (y_value - y_average)
denominator += (x_value - x_average)**2
return numerator / denominator
```

As shown in the Code For Multiple Linear Regression section, we load in a dataset from Scikit-Learn.

The following is our `main.py`

file that loads the dataset, picks one feature and runs the simple linear regression model. After experimentation, we found that the feature called *LSTAT* performed the best in terms of the r2 score.

```
from load_dataset import x_train, x_test, y_train, y_test
from simple_linear_regression import SimpleLinearRegression
# pick a single feature to estimate y
x_train = x_train['LSTAT'].values
x_test = x_test['LSTAT'].values
y_train = y_train.values
y_test = y_test.values
# fit to data
slr = SimpleLinearRegression()
slr.fit(x_train, y_train)
# make predictions and score
pred = slr.predict(x_test)
score = slr.r2_score(y_test, pred)
print(f'Final R^2 score: {score}')
```

We did conveniently pick the feature that gave us the highest r squared score, but comparatively to the multiple linear regression, our model is not that far off:

```
Final R^2 score: 0.5429180422970383
```

## Special Case 2: Polynomial Regression

Another case of multiple linear regression is polynomial regression, which might look like the following formula:

{displaystyle y_{i},=,beta _{0}+beta _{1}x_{i}+beta _{2}x_{i}^{2}+cdots +beta _{m}x_{i}^{m}+varepsilon _{i} (i=1,2,dots ,n)}

$$

The formula is flexible in the exponents, so that it can be changed to model specific problems better. The code for polynomial regression is the same as for multiple linear regression, except for the predict function. We increase the degree of the exponent by one for each feature and use those as our values instead.

```
def predict_polynomial(self, x):
'''
y = β0 + β1*x + β2*x^2 + ... + βm*x_i^m
'''
predictions = []
for index, row in x.iterrows():
# treating each feature as a variable that needs to be raised to the power of m
polynomial_values = [feature**i+1 for i, feature in enumerate(row.values)]
pred = np.multiply(polynomial_values, self.coefficients)
pred = sum(pred)
pred += self.intercept
predictions.append(pred)
return predictions
```

Though, one of the drawbacks of polynomial regression is that you have to find out which formula might work for you, and that gets quite hard as the number of variables grow, since we don’t really have much intuition for what a 4-, 7- or 100-dimensional space looks like.

Perhaps a naive solution to the problem, is finding the best polynomial by brute force; trying all different permutations that you can think of, e.g. with a varying degree from 1 to 5.

## References

- One very in-depth explanation of the mathematics by Martina Bremer from Cornell University – recommended if you are somewhat fluent in Linear Algebra.
- Thumbnail/GIF credits to assessingpsyche.

Source: https://mlfromscratch.com/linear-regression-from-scratch/