Univariate Linear Regression

Dataset

We'll see how linear regression works but first we need a dataset. Experiment with the values below to generate a dataset.

Noise Factor

Number of Points

Problem Statement

Given a set of $m$ training examples $\left(x^{(i)}, y^{(i)}\right)$ for all $i\in[1,\ldots,m]$ , a linear regression model is a supervised learning regression model which expresses $f(x^{(i)})=\hat{y}^{(i)}$ as a linear function of $x^{(i)}$ given parameters $w$ and $b$ as follows:

\hat{y}^{(i)}=w\cdot x^{(i)}+b

The model is a supervised learning model because the training set contains the "right" or expected output value for the target variable for every $x^{(i)}$ , and a regression model because the model outputs a continuous value. Linear regression with one variable is known as a univariate linear regression, i.e, $x^{(i)}\in\mathbb{R}$ while multivariate linear regression is one wherein $x^{(i)}\in\mathbb{R}^{n}$ for $n>1$ features.

The goal of the linear regression model then is to find optimal parameters $w$ and $b$ such that for every training example $x^{(i)}, y^{(i)}$ for $i\in[1,\ldots,m]$

f(x^{(i)})\approx y^{(i)}

Cost Function

It helps to provide a precise measure of how far off the predicted outputs are from the expected output. This can be formalized as a cost function $J$ . A useful cost function for linear regression is the mean squared error,

J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right)^2

One of the benefits of the $\frac{1}{m}$ is that without it, as $m$ gets larger, the value of the cost function gets larger as well. Therefore it helps to build a cost function that does not change significantly just because the size of the dataset increases. The $2$ in $2m$ simplifies the calculation of the gradient later on given that the minimum of $f(x)$ is the same as the minimum of $\frac{1}{2}f(x)$ . Given the cost function, the goal is to find optimal parameters $w^*$ and $b^*$ which minimize $J(w,b)$ .

Take a look at the contour plot of the loss function below. Observe that the plot looks somewhat convex, with the deepest point being wherever the true coefficient and intercept was set above. This is precisely the optimal point to minimize the mean squared error cost function.

TODO: PLOT

The convex nature of the function itself is more apparent and easily visualizable given another 3-dimensional plot as you can see below. As before, the lowest point can be identified based on the location of the minimum point and is based on whatever values you may have set above. Any other guess by the model will incur additional loss.

TODO: PLOT

Why Use Mean Squared Error?

The mean squared error cost function comes from a statistical method known as maximum likelihood estimation. Recall that the underlying assumption for linear regression is that the data is accurately modelled with a linear function. The error terms are assumed to be $\text{i.i.d}$ from a normal distribution with mean $0$ and constant variance $\sigma^2$ , i.e for all $i\in[1,\ldots,m]$

\begin{aligned} \hat{y}^{(i)} &= w_\text{true}x^{(i)}+b_\text{true}+\epsilon \quad\text{where }\epsilon\overset{i.i.d}\sim\mathcal{N}(0,\sigma^2)\\ \hat{y}^{(i)} &= y^{(i)}+\epsilon\\ \epsilon &= \hat{y}^{(i)}-y^{(i)} \end{aligned}

Thus the probability density function of the error terms for each sample $\epsilon^{(i)}$ for $i\in[1,\ldots,m]$ are represented with a Gaussian or normal distribution with parameters $\mu=0$ and $\theta_2=\sigma^2$ as follows:

f(\epsilon^{(i)};0,\theta_2)=\frac{1}{\sqrt{\theta_2}\sqrt{2\pi}}\exp\left[-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]

The likelihood, represents the joint probability density fuction of observing the data that was gathered. Assuming that all samples are independent this is simply the product of all the probability density fuctions for $\epsilon$

\begin{aligned} L(\theta_2) &= f(x_1;0,\theta_2)\cdot f(x_2;0,\theta_2) \cdot\dots\cdot f(x_n;0,\theta_2)\\ &= \prod_{i=1}^{m}f(\hat{y}^{(i)}-y^{(i)};0,\theta_2) \end{aligned}

The goal is to find the value $\theta_2$ which maximizes the likelihood of seeing the data that was gathered given these parameters. In this case, given the probability density function of the normally distributed error terms

\begin{aligned} L(\theta_2) &=\prod_{i=1}^{m}\left(\frac{1}{\sqrt{\theta_2}\sqrt{2\pi}}\exp\left[-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]\right)\\ &=\theta_2^{-{\frac{m}{2}}}(2\pi)^{-{\frac{m}{2}}}\exp\left[\sum_{i=1}^{m}-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right] \end{aligned}

Given that the $\log(x)$ is a strictly increasingly monotic function, i.e, for every $x_1 < x_2$ , $f(x_1) < f(x_2)$ , then the maximum of $L(\theta_2)$ is also a maximum of the $\log\left(L(\theta_2)\right)$ . This allows the derivation to be more convenient based on the properties of $\log$

\begin{aligned} \log\left(L(\theta_2)\right) &=\log\left(\theta_2^{-{\frac{m}{2}}}(2\pi)^{-{\frac{m}{2}}}\exp\left[\sum_{i=1}^{m}-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]\right)\\ &=-\frac{m}{2}\log(\theta_2)-\frac{m}{2}\log(2\pi)-\frac{1}{2m}\sum_{i=1}^{m}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2} \end{aligned}

Taking the partial derviative w.r.t to $\theta_2$ and setting it to $0$

\begin{aligned} \frac{\partial\log\left(L(\theta_2)\right)}{\partial \theta_2} &=-\frac{m}{2\theta_2}+\sum_{i=1}^{m}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{2{\theta_2}^2}\\ 0\times 2\theta_2^2&=\left(-\frac{m}{2\theta_2}+\frac{1}{2{\theta_2}^2}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right)\times 2{\theta_2}^2\\ 0&=-{\theta_2 m}+\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2 \end{aligned}

Setting the equation to $0$ allows us to solve for $\theta_2$ , i.e, the maximum likelihood estimator for the variance of the error terms.

\begin{aligned} 0 &= -{\theta_2 m}+\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\\ \theta_2 &= \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2 \end{aligned}

Gradient Descent

Gradient descent is an iterative algorithm that allows for a linear regression model — and many other more complex deep learning models — to arrive at a minimum of the loss function.

The idea is that given $w$ and $b$ which are initialized in any manner, the algorithm will continuously update these parameters in the steepest direction (negative of the gradient) towards a minimum of the cost function. This iteration is repeated until convergence.

Note that since the mean squared error cost function is convex, any local minimum is also the global minimum. This is not necessarily true for other cost functions, leading to multiple local minima depending on the initial starting point.

The parameters $w$ and $b$ on the $k$ -th iteration are continuously updated (simultaneously) as follows,

\begin{aligned} w^{[k+1]} & \leftarrow w^{[k]}-\alpha\frac{\partial J(w^{[k]},b^{[k]})}{\partial w^{[k]}}\\ b^{[k+1]} & \leftarrow b{[k]}-\alpha\frac{\partial J(w^{[k]},b^{[k]})}{\partial b^{[k]}} \end{aligned}

where $\alpha$ describes the learning rate, essentially a hyperparameter.

Calculating the Gradient

Given the cost function $J(w,b)$ which has been previously defined as,

J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2

The gradient of the cost function w.r.t the weights $w$ is

\nabla_{w}J(w,b)=\frac{\partial J(w,b)}{\partial w}

The components of this gradient (where $w_j=w$ ) in the univariate case and $j\in[1,\ldots,n]$ for the multivariate case) can be calculated as,

\begin{aligned} \frac{\partial J(w,b)}{\partial w} &=\frac{\partial}{\partial w}\left[\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right]\\[1em] &=\frac{1}{2m}\cdot\sum_{i=1}^{m}\left[\frac{\partial}{\partial w}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right]\\[1em] &=\frac{1}{2m}\cdot\sum_{i=1}^{m}\left[2\left(\hat{y}^{(i)}-y^{(i)}\right)\cdot\frac{\partial}{\partial w_j}\left(wx^{(i)}+b-y^{(i)}\right)\right]\\[1em] &=\frac{1}{2m}\sum_{i=1}^{m}\left[2\left(\hat{y}^{(i)}-y^{(i)}\right)\cdot x^{(i)}\right]\\[1em] &=\frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})\cdot x^{(i)} \end{aligned}

Likewise, $\frac{\partial J(w,b)}{b}$ can be calculated as

\frac{\partial J(w,b)}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)

Training the Model

Experiment with the values below to modify the parameters of the linear regression model. Click on the 'Train Model' button to start training the linear model.

Analytical Solution

Given that the mean squared error cost function is a convex function, the minimum can be solved directly by writing the derivatives as a system of equations with two unknowns and finding the solution for

\vec{\theta}=\begin{bmatrix}0, 0\end{bmatrix}^\top

\begin{aligned} \sum_{i=1}^{m}\left[(w\cdot x^{(i)}+b-y^{(i)})\cdot x^{(i)}\right]&=0\\ \sum_{i=1}^{m}(w\cdot x^{(i)}+b-y^{(i)}) &=0\\ \end{aligned}

Distributing the summation and transposing constants to the right hand side

\begin{alignat*}{3} &(1)\quad w\sum_{i=1}^{m}(x^{(i)})^2&&+b\sum_{i=1}^{m}x^{(i)}&&=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ &(2)\quad w\sum_{i=1}^{m}x^{(i)}&&+bm&&=\sum_{i=1}^{m}y^{(i)} \end{alignat*}

Solving for $w$

Elimination of the second term,

\begin{aligned} w\sum_{i=1}^{m}(x^{(i)})^2+b\sum_{i=1}^{m}x^{(i)} &=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ -\frac{\sum_{i=1}^{m}x^{(i)}}{m}(w\sum_{i=1}^{m}x^{(i)}+mb &=\sum_{i=1}^{m}y^{(i)})\\ \end{aligned}\\[1em] \begin{aligned} w\sum_{i=1}^{m}(x^{(i)})^2+b\sum_{i=1}^{m}x^{(i)} &=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ -w\frac{\left(\sum_{i=1}^{m}x^{(i)}\right)^2}{m}-b\sum_{i=1}^{m}x^{(i)}&=-\frac{\sum_{i=1}^{m}y^{(i)}\sum_{i=1}^{m}x^{(i)}}{m} \end{aligned}

Solving for $w$ ,

w\left[\frac{m\sum_{i=1}^{m}(x^{(i)})^2-\left(\sum_{i=1}^{m}x^{(i)}\right)^2}{m}\right]=\frac{m\sum_{i=1}^{m}y^{(i)}x^{(i)}-\sum_{i=1}^{m}y^{(i)}\sum_{i=1}^{m}x^{(i)}}{m}\\[1em] w=\frac{m\sum_{i}y^{(i)}x^{(i)}-\sum_{i}y^{(i)}\sum_{i}x^{(i)}}{m\sum_{i}(x^{(i)})^2-\left(\sum_{i}x^{(i)}\right)^2}

Solving for $b$

Solve for $w$ from equation (2) and substitute into equation (1):

w=\frac{\sum_{i}y^{(i)}-bm}{\sum_{i}x^{(i)}}

Solving for $b$ ,

\begin{aligned} \left(\frac{\sum_{i}y^{(i)}-bm}{\sum_{i}x^{(i)}}\right)\sum_{i}(x^{(i)})^2+b\sum_{i}x^{(i)}&=\sum_{i}y^{(i)}x^{(i)}\\[1em] \frac{\sum_{i}y^{(i)}\sum_{i}(x^{(i)})^2-mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2}{\sum_{i}x^{(i)}}&=\sum_{i}y^{(i)}x^{(i)}\\[1em] \sum_{i}y^{(i)}\sum_{i}(x^{(i)})^2-mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}\\[1em] -mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}-\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2\\[1em] mb\sum_{i}(x^{(i)})^2-b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2-\sum_{i=1}^{m}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)} \end{aligned}\\[2em] b=\frac{\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2-\sum_{i=1}^{m}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}}{m\sum_{i}(x^{(i)})^2-\left(\sum_{i}x^{(i)}\right)^2}

Dataset Problem Statement Cost Function Why Use Mean Squared Error?Gradient Descent Calculating the Gradient Training the Model Analytical Solution