Univariate Linear Regression

Dataset

We'll see how linear regression works but first we need a dataset. Experiment with the values below to generate a dataset.

Problem Statement

Given a set of mm training examples (x(i),y(i))\left(x^{(i)}, y^{(i)}\right) for all i[1,,m]i\in[1,\ldots,m], a linear regression model is a supervised learning regression model which expresses f(x(i))=y^(i)f(x^{(i)})=\hat{y}^{(i)} as a linear function of x(i)x^{(i)} given parameters ww and bb as follows:

y^(i)=wx(i)+b \hat{y}^{(i)}=w\cdot x^{(i)}+b

The model is a supervised learning model because the training set contains the "right" or expected output value for the target variable for every x(i)x^{(i)}, and a regression model because the model outputs a continuous value. Linear regression with one variable is known as a univariate linear regression, i.e, x(i)Rx^{(i)}\in\mathbb{R} while multivariate linear regression is one wherein x(i)Rnx^{(i)}\in\mathbb{R}^{n} for n>1n>1 features.

The goal of the linear regression model then is to find optimal parameters ww and bb such that for every training example x(i),y(i)x^{(i)}, y^{(i)} for i[1,,m]i\in[1,\ldots,m]

f(x(i))y(i) f(x^{(i)})\approx y^{(i)}

Cost Function

It helps to provide a precise measure of how far off the predicted outputs are from the expected output. This can be formalized as a cost function JJ. A useful cost function for linear regression is the mean squared error,

J(w,b)=12mi=1m(y^(i)y(i))2 J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right)^2

One of the benefits of the 1m\frac{1}{m} is that without it, as mm gets larger, the value of the cost function gets larger as well. Therefore it helps to build a cost function that does not change significantly just because the size of the dataset increases. The 22 in 2m2m simplifies the calculation of the gradient later on given that the minimum of f(x)f(x) is the same as the minimum of 12f(x)\frac{1}{2}f(x). Given the cost function, the goal is to find optimal parameters ww^* and bb^* which minimize J(w,b)J(w,b).

Take a look at the contour plot of the loss function below. Observe that the plot looks somewhat convex, with the deepest point being wherever the true coefficient and intercept was set above. This is precisely the optimal point to minimize the mean squared error cost function.

TODO: PLOT

The convex nature of the function itself is more apparent and easily visualizable given another 3-dimensional plot as you can see below. As before, the lowest point can be identified based on the location of the minimum point and is based on whatever values you may have set above. Any other guess by the model will incur additional loss.

TODO: PLOT

Why Use Mean Squared Error?

The mean squared error cost function comes from a statistical method known as maximum likelihood estimation. Recall that the underlying assumption for linear regression is that the data is accurately modelled with a linear function. The error terms are assumed to be i.i.d\text{i.i.d} from a normal distribution with mean 00 and constant variance σ2\sigma^2, i.e for all i[1,,m]i\in[1,\ldots,m]

y^(i)=wtruex(i)+btrue+ϵwhere ϵi.i.dN(0,σ2)y^(i)=y(i)+ϵϵ=y^(i)y(i) \begin{aligned} \hat{y}^{(i)} &= w_\text{true}x^{(i)}+b_\text{true}+\epsilon \quad\text{where }\epsilon\overset{i.i.d}\sim\mathcal{N}(0,\sigma^2)\\ \hat{y}^{(i)} &= y^{(i)}+\epsilon\\ \epsilon &= \hat{y}^{(i)}-y^{(i)} \end{aligned}

Thus the probability density function of the error terms for each sample ϵ(i)\epsilon^{(i)} for i[1,,m]i\in[1,\ldots,m] are represented with a Gaussian or normal distribution with parameters μ=0\mu=0 and θ2=σ2\theta_2=\sigma^2 as follows:

f(ϵ(i);0,θ2)=1θ22πexp[12(y^(i)y(i))2θ2] f(\epsilon^{(i)};0,\theta_2)=\frac{1}{\sqrt{\theta_2}\sqrt{2\pi}}\exp\left[-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]

The likelihood, represents the joint probability density fuction of observing the data that was gathered. Assuming that all samples are independent this is simply the product of all the probability density fuctions for ϵ\epsilon

L(θ2)=f(x1;0,θ2)f(x2;0,θ2)f(xn;0,θ2)=i=1mf(y^(i)y(i);0,θ2) \begin{aligned} L(\theta_2) &= f(x_1;0,\theta_2)\cdot f(x_2;0,\theta_2) \cdot\dots\cdot f(x_n;0,\theta_2)\\ &= \prod_{i=1}^{m}f(\hat{y}^{(i)}-y^{(i)};0,\theta_2) \end{aligned}

The goal is to find the value θ2\theta_2 which maximizes the likelihood of seeing the data that was gathered given these parameters. In this case, given the probability density function of the normally distributed error terms

L(θ2)=i=1m(1θ22πexp[12(y^(i)y(i))2θ2])=θ2m2(2π)m2exp[i=1m12(y^(i)y(i))2θ2] \begin{aligned} L(\theta_2) &=\prod_{i=1}^{m}\left(\frac{1}{\sqrt{\theta_2}\sqrt{2\pi}}\exp\left[-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]\right)\\ &=\theta_2^{-{\frac{m}{2}}}(2\pi)^{-{\frac{m}{2}}}\exp\left[\sum_{i=1}^{m}-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right] \end{aligned}

Given that the log(x)\log(x) is a strictly increasingly monotic function, i.e, for every x1<x2x_1 < x_2, f(x1)<f(x2)f(x_1) < f(x_2), then the maximum of L(θ2)L(\theta_2) is also a maximum of the log(L(θ2))\log\left(L(\theta_2)\right) . This allows the derivation to be more convenient based on the properties of log\log

log(L(θ2))=log(θ2m2(2π)m2exp[i=1m12(y^(i)y(i))2θ2])=m2log(θ2)m2log(2π)12mi=1m(y^(i)y(i))2θ2 \begin{aligned} \log\left(L(\theta_2)\right) &=\log\left(\theta_2^{-{\frac{m}{2}}}(2\pi)^{-{\frac{m}{2}}}\exp\left[\sum_{i=1}^{m}-\frac{1}{2}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2}\right]\right)\\ &=-\frac{m}{2}\log(\theta_2)-\frac{m}{2}\log(2\pi)-\frac{1}{2m}\sum_{i=1}^{m}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{\theta_2} \end{aligned}

Taking the partial derviative w.r.t to θ2\theta_2 and setting it to 00

log(L(θ2))θ2=m2θ2+i=1m(y^(i)y(i))22θ220×2θ22=(m2θ2+12θ22i=1m(y^(i)y(i))2)×2θ220=θ2m+i=1m(y^(i)y(i))2 \begin{aligned} \frac{\partial\log\left(L(\theta_2)\right)}{\partial \theta_2} &=-\frac{m}{2\theta_2}+\sum_{i=1}^{m}\frac{\left(\hat{y}^{(i)}-y^{(i)}\right)^2}{2{\theta_2}^2}\\ 0\times 2\theta_2^2&=\left(-\frac{m}{2\theta_2}+\frac{1}{2{\theta_2}^2}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right)\times 2{\theta_2}^2\\ 0&=-{\theta_2 m}+\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2 \end{aligned}

Setting the equation to 00 allows us to solve for θ2\theta_2, i.e, the maximum likelihood estimator for the variance of the error terms.

0=θ2m+i=1m(y^(i)y(i))2θ2=1mi=1m(y^(i)y(i))2 \begin{aligned} 0 &= -{\theta_2 m}+\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\\ \theta_2 &= \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2 \end{aligned}

Gradient Descent

Gradient descent is an iterative algorithm that allows for a linear regression model — and many other more complex deep learning models — to arrive at a minimum of the loss function.

The idea is that given ww and bb which are initialized in any manner, the algorithm will continuously update these parameters in the steepest direction (negative of the gradient) towards a minimum of the cost function. This iteration is repeated until convergence.

Note that since the mean squared error cost function is convex, any local minimum is also the global minimum. This is not necessarily true for other cost functions, leading to multiple local minima depending on the initial starting point.

The parameters ww and bb on the kk-th iteration are continuously updated (simultaneously) as follows,

w[k+1]w[k]αJ(w[k],b[k])w[k]b[k+1]b[k]αJ(w[k],b[k])b[k] \begin{aligned} w^{[k+1]} & \leftarrow w^{[k]}-\alpha\frac{\partial J(w^{[k]},b^{[k]})}{\partial w^{[k]}}\\ b^{[k+1]} & \leftarrow b{[k]}-\alpha\frac{\partial J(w^{[k]},b^{[k]})}{\partial b^{[k]}} \end{aligned}

where α\alpha describes the learning rate, essentially a hyperparameter.

Calculating the Gradient

Given the cost function J(w,b)J(w,b) which has been previously defined as,

J(w,b)=12mi=1m(y^(i)y(i))2 J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2

The gradient of the cost function w.r.t the weights ww is

wJ(w,b)=J(w,b)w \nabla_{w}J(w,b)=\frac{\partial J(w,b)}{\partial w}

The components of this gradient (where wj=ww_j=w) in the univariate case and j[1,,n]j\in[1,\ldots,n] for the multivariate case) can be calculated as,

J(w,b)w=w[12mi=1m(y^(i)y(i))2]=12mi=1m[w(y^(i)y(i))2]=12mi=1m[2(y^(i)y(i))wj(wx(i)+by(i))]=12mi=1m[2(y^(i)y(i))x(i)]=1mi=1m(y^(i)y(i))x(i) \begin{aligned} \frac{\partial J(w,b)}{\partial w} &=\frac{\partial}{\partial w}\left[\frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right]\\[1em] &=\frac{1}{2m}\cdot\sum_{i=1}^{m}\left[\frac{\partial}{\partial w}\left(\hat{y}^{(i)}-y^{(i)}\right)^2\right]\\[1em] &=\frac{1}{2m}\cdot\sum_{i=1}^{m}\left[2\left(\hat{y}^{(i)}-y^{(i)}\right)\cdot\frac{\partial}{\partial w_j}\left(wx^{(i)}+b-y^{(i)}\right)\right]\\[1em] &=\frac{1}{2m}\sum_{i=1}^{m}\left[2\left(\hat{y}^{(i)}-y^{(i)}\right)\cdot x^{(i)}\right]\\[1em] &=\frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})\cdot x^{(i)} \end{aligned}

Likewise, J(w,b)b\frac{\partial J(w,b)}{b} can be calculated as

J(w,b)b=1mi=1m(y^(i)y(i)) \frac{\partial J(w,b)}{\partial b}=\frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)}-y^{(i)}\right)

Training the Model

Experiment with the values below to modify the parameters of the linear regression model. Click on the 'Train Model' button to start training the linear model.

Analytical Solution

Given that the mean squared error cost function is a convex function, the minimum can be solved directly by writing the derivatives as a system of equations with two unknowns and finding the solution for

θ=[0,0] \vec{\theta}=\begin{bmatrix}0, 0\end{bmatrix}^\topi=1m[(wx(i)+by(i))x(i)]=0i=1m(wx(i)+by(i))=0 \begin{aligned} \sum_{i=1}^{m}\left[(w\cdot x^{(i)}+b-y^{(i)})\cdot x^{(i)}\right]&=0\\ \sum_{i=1}^{m}(w\cdot x^{(i)}+b-y^{(i)}) &=0\\ \end{aligned}

Distributing the summation and transposing constants to the right hand side

(1)wi=1m(x(i))2+bi=1mx(i)=i=1my(i)x(i)(2)wi=1mx(i)+bm=i=1my(i) \begin{alignat*}{3} &(1)\quad w\sum_{i=1}^{m}(x^{(i)})^2&&+b\sum_{i=1}^{m}x^{(i)}&&=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ &(2)\quad w\sum_{i=1}^{m}x^{(i)}&&+bm&&=\sum_{i=1}^{m}y^{(i)} \end{alignat*}

Solving for ww

Elimination of the second term,

wi=1m(x(i))2+bi=1mx(i)=i=1my(i)x(i)i=1mx(i)m(wi=1mx(i)+mb=i=1my(i))wi=1m(x(i))2+bi=1mx(i)=i=1my(i)x(i)w(i=1mx(i))2mbi=1mx(i)=i=1my(i)i=1mx(i)m \begin{aligned} w\sum_{i=1}^{m}(x^{(i)})^2+b\sum_{i=1}^{m}x^{(i)} &=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ -\frac{\sum_{i=1}^{m}x^{(i)}}{m}(w\sum_{i=1}^{m}x^{(i)}+mb &=\sum_{i=1}^{m}y^{(i)})\\ \end{aligned}\\[1em] \begin{aligned} w\sum_{i=1}^{m}(x^{(i)})^2+b\sum_{i=1}^{m}x^{(i)} &=\sum_{i=1}^{m}y^{(i)}x^{(i)}\\ -w\frac{\left(\sum_{i=1}^{m}x^{(i)}\right)^2}{m}-b\sum_{i=1}^{m}x^{(i)}&=-\frac{\sum_{i=1}^{m}y^{(i)}\sum_{i=1}^{m}x^{(i)}}{m} \end{aligned}

Solving for ww,

w[mi=1m(x(i))2(i=1mx(i))2m]=mi=1my(i)x(i)i=1my(i)i=1mx(i)mw=miy(i)x(i)iy(i)ix(i)mi(x(i))2(ix(i))2 w\left[\frac{m\sum_{i=1}^{m}(x^{(i)})^2-\left(\sum_{i=1}^{m}x^{(i)}\right)^2}{m}\right]=\frac{m\sum_{i=1}^{m}y^{(i)}x^{(i)}-\sum_{i=1}^{m}y^{(i)}\sum_{i=1}^{m}x^{(i)}}{m}\\[1em] w=\frac{m\sum_{i}y^{(i)}x^{(i)}-\sum_{i}y^{(i)}\sum_{i}x^{(i)}}{m\sum_{i}(x^{(i)})^2-\left(\sum_{i}x^{(i)}\right)^2}

Solving for bb

Solve for ww from equation (2) and substitute into equation (1):

w=iy(i)bmix(i) w=\frac{\sum_{i}y^{(i)}-bm}{\sum_{i}x^{(i)}}

Solving for bb,

(iy(i)bmix(i))i(x(i))2+bix(i)=iy(i)x(i)iy(i)i(x(i))2mbi(x(i))2+b(ix(i))2ix(i)=iy(i)x(i)iy(i)i(x(i))2mbi(x(i))2+b(ix(i))2=iy(i)x(i)ix(i)mbi(x(i))2+b(ix(i))2=iy(i)x(i)ix(i)iy(i)i(x(i))2mbi(x(i))2b(ix(i))2=iy(i)i(x(i))2i=1my(i)x(i)ix(i)b=iy(i)i(x(i))2i=1my(i)x(i)ix(i)mi(x(i))2(ix(i))2 \begin{aligned} \left(\frac{\sum_{i}y^{(i)}-bm}{\sum_{i}x^{(i)}}\right)\sum_{i}(x^{(i)})^2+b\sum_{i}x^{(i)}&=\sum_{i}y^{(i)}x^{(i)}\\[1em] \frac{\sum_{i}y^{(i)}\sum_{i}(x^{(i)})^2-mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2}{\sum_{i}x^{(i)}}&=\sum_{i}y^{(i)}x^{(i)}\\[1em] \sum_{i}y^{(i)}\sum_{i}(x^{(i)})^2-mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}\\[1em] -mb\sum_{i}(x^{(i)})^2+b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}-\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2\\[1em] mb\sum_{i}(x^{(i)})^2-b\left(\sum_{i}x^{(i)}\right)^2&=\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2-\sum_{i=1}^{m}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)} \end{aligned}\\[2em] b=\frac{\sum_{i}y^{(i)}\cdot\sum_{i}(x^{(i)})^2-\sum_{i=1}^{m}y^{(i)}x^{(i)}\cdot\sum_{i}x^{(i)}}{m\sum_{i}(x^{(i)})^2-\left(\sum_{i}x^{(i)}\right)^2}