Objectives

define linear and non-linear regression models.
reason about goodness of fit criteria.
derive coefficients of linear regression models.
derive coefficients of non-linear regression models

Resources

Regression is different from interpolation in that it allows us to approximate overdetermined system, which has more equations than unknowns. This is useful when the exact solution is too expensive or unnecessary due to errors in the data, such as measurement errors or random noise.

Linear Regression

Linear regression finds a linear function that most nearly passes through the given data points - the regression (function) line best fits the data. We must define our metric for measuring the goodness of fit. If all data points lie on the function it is a perfect fit, otherwise there are errors in the function representation of the data. We can measure the deviations of the data points from the function. As shown in the following example (source) neither the sum of errors nor the sum of the absolute errors is a good metric. The data include four points (2, 4), (3, 6), (2, 6), and (3, 8). We use a straight line to fit the data. Two possible solutions are shown in iPython notebook (Example 1).

Straight Line (one variable)

Lets look at the example of fitting a straight line to data, i.e. find a linear regression model with one variable that represents the data. The function is $f(x)=a_{0}+a_{1}x$ and the (cost) sum of squares of errors function to be minimized is

S(a_{0},a_{1})=\sum _{i=1}^{n}{(y_{i}-f(x_{i}))}^{2}=\sum _{i=1}^{n}{[y_{i}-a_{0}-a_{1}x_{i}]}^{2}

We can minimize the $S(a,b)$ function by setting the gradient to zero. Because the function has two parameters (variables) there are two gradient equations:

{\frac {\partial S}{\partial a_{0}}}=2\sum _{i=1}^{n}(y_{i}-a_{0}-a_{1}x_{i})(-1)=0

{\frac {\partial S}{\partial a_{1}}}=2\sum _{i=1}^{n}(y_{i}-a_{0}-a_{1}x_{i})(-x_{i})=0

The partial derivative represents the rate of change of the function value with respect to an independent variable while all other variables being held fixed. When a partial derivative becomes zero it means the change has stopped. This implies that we have reach a minimum or a maximum, which can be determined by checking the sign of the 2nd derivative.

Let's consider:

{\begin{aligned}-\sum _{i=1}^{n}y_{i}+\sum _{i=1}^{n}a_{0}+\sum _{i=1}^{n}a_{1}x_{i}&=0\\-\sum _{i=1}^{n}y_{i}+na_{0}+\sum _{i=1}^{n}a_{1}x_{i}&=0\\\end{aligned}}

and

{\begin{aligned}-\sum _{i=1}^{n}y_{i}x_{i}+\sum _{i=1}^{n}a_{0}x_{i}+\sum _{i=1}^{n}a_{1}x_{i}^{2}&=0\\\end{aligned}}

,

we can derive

{\begin{aligned}na_{0}+a_{1}\sum _{i=1}^{n}x_{i}&=\sum _{i=1}^{n}y_{i}\\a_{0}\sum _{i=1}^{n}x_{i}+a_{1}\sum _{i=1}^{n}x_{i}^{2}&=\sum _{i=1}^{n}x_{i}y_{i}\end{aligned}}

.

Therefore, we can calculate $a_{0}$ and $a_{1}$ as follows:

a_{1}={\frac {n\sum _{i=1}^{n}x_{i}y_{i}-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}y_{i}}{n\sum _{i=1}^{n}x_{i}^{2}-(\sum _{i=1}^{n}x_{i})^{2}}}

a_{0}={\frac {\sum _{i=1}^{n}x_{i}^{2}\sum _{i=1}^{n}y_{i}-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}x_{i}y_{i}}{n\sum _{i=1}^{n}x_{i}^{2}-(\sum _{i=1}^{n}x_{i})^{2}}}

.

Given the following definitions:

{\begin{aligned}{\bar {x}}&={\frac {\sum _{i=1}^{n}x_{i}}{n}}\\{\bar {y}}&={\frac {\sum _{i=1}^{n}y_{i}}{n}}\\S_{1}&=\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {x}}{\bar {y}}\\S_{2}&=\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}\end{aligned}}

we get

a_{1}={\frac {S1}{S2}}

a_{0}={\bar {y}}-a_{1}{\bar {x}}

.

Multi-linear Regression

A linear regression line models the relationship between independent variables (predictors) and a response variable. Once the model is built it can be used for prediction. In almost all real-world regression models multiple predictors (independent variables) are involved to model multiple factors that affect the response (dependent variable) from the system. Such models are known as multi-linear models.

Normal Equation

Another way to find the parameters of a regression model that minimize the sum of squares of errors is to solve the corresponding normal equation. Given a matrix equation $Ax=b$ , the normal equation is that which minimizes the sum of the square differences between the left and right sides:

A^{T}Ax=A^{T}b

.

Given a hypothesized regression model and a dataset we can construct the left side to express the values our model would predict using the data. A should be a $m\times n$ matrix where each row represents one of the $m$ data points (samples) and each column of each row represents a multiplier for each of the $n$ parameters. $x$ is a $1\times n$ column vector for the unknown parameters. The right side $b$ should be a $1\times m$ column vector that stores the corresponding $y$ values for the data points.

Recall the example that fits a straight line (model) $y=a_{0}+a_{1}x$ to the following data points.

data points
x	y
2	4
2	6
3	6
3	8

We can construct the left side of the matrix equation $Ax=b$ as:

Ax=\underbrace {\begin{bmatrix}1&2\\1&2\\1&3\\1&3\end{bmatrix}} _{A}\underbrace {\begin{bmatrix}a_{0}\\a_{1}\end{bmatrix}} _{x}

,

which should result in a column vector of the values our model would predict:

{\begin{bmatrix}a_{0}+2a_{1}\\a_{0}+2a_{1}\\a_{0}+3a_{1}\\a_{0}+3a_{1}\end{bmatrix}}

.

The right side should be a vector of corresponding values from the data points:

b={\begin{bmatrix}4\\6\\6\\8\end{bmatrix}}.

To minimize the sum of square differences between the left and the right sides is equivalent to solving the following normal equation:

A^{T}Ax=A^{T}b

.

The normal equation for our example is:

\underbrace {\begin{bmatrix}1&1&1&1\\2&2&3&3\end{bmatrix}} _{A^{T}}\underbrace {\begin{bmatrix}1&2\\1&2\\1&3\\1&3\end{bmatrix}} _{A}\underbrace {\begin{bmatrix}a_{0}\\a_{1}\end{bmatrix}} _{x}=\underbrace {\begin{bmatrix}1&1&1&1\\2&2&3&3\end{bmatrix}} _{A^{T}}\underbrace {\begin{bmatrix}4\\6\\6\\8\end{bmatrix}} _{b}

If $A^{T}A$ is invertible, the solution vector should be unique and give us the values for the parameters $a_{0}$ and $a_{1}$ that minimizes the difference between the model and the data, i.e. the model best fits the data.

When we determine a regression line of the form $y=a_{0}+a_{1}x$ that fits four data points, we have four equations that can be written as $Ax=y$ or

{\begin{bmatrix}1&x^{(1)}\\1&x^{(2)}\\1&x^{(3)}\\1&x^{(4)}\end{bmatrix}}{\begin{bmatrix}a_{0}\\a_{1}\end{bmatrix}}={\begin{bmatrix}y^{(0)}\\y^{(1)}\\y^{(2)}\\y^{(3)}\end{bmatrix}}

To solve the model we are effectively minimizing $||Ax-b||$ .

$y=a_{0}+a_{1}x+a_{2}x^{2}$

$y=a_{0}+a_{1}sin(x)$

$y=a_{0}e^{-a_{1}x}$

The method introduced in the previous section boils down to solving a system of linear equations with two unknowns in the following form:

{\begin{bmatrix}p&q\\r&s\end{bmatrix}}{\begin{bmatrix}a_{0}\\a_{1}\end{bmatrix}}={\begin{bmatrix}u\\v\end{bmatrix}}

It is not hard to imagine that with $n$ independent variables and $m$ data points we can derive a similar system of linear equations with $n$ unknowns. Then numerical methods, such as Gaussian elimination can be used to solve for the parameters. We could also use normal equations and matrix operations to solve for the parameters.

Gradient Descent

Gradient descent is a method for finding local minimum of a function. The method is based on the concept of Gradient, which is a generalization of the derivative of a function in one dimension (slope or tangent) to a function in multiple dimensions. For a function with $n$ variables the gradient at a particular point is a vector whose components are the $n$ partial derivatives of the function. The following figure shows the gradient of a function.

The gradient of the function

f (x, y) = -(cos 2 x + cos 2 y) 2

depicted as a projected vector field on the bottom plane.

Because the gradient points in the direction of the greatest rate of increase of the function and its magnitude is the slope of the graph in that direction, we can start at a random point and take steps proportional to the negative of the gradient of the current point to find the local minimum - gradient decent or steepest descent.

Recall that gradients in multiple dimensions are partial derivatives of the cost function with respect to the parameters for the dimensions:

S(a_{0},a_{1},\cdots ,a_{n})=\sum _{i=1}^{n}{(y^{(i)}-f(x_{1}^{(i)},x_{2}^{(i)},\cdots ,x_{n}^{(i)}))}^{2}=\sum _{i=1}^{n}{(y^{(i)}-a_{0}-a_{1}x_{1}^{(i)}-\cdots -a_{n}x_{n}^{(i)})}^{2}

{\frac {\partial S}{\partial a_{0}}}=2\sum _{i=1}^{n}(y^{(i)}-a_{0}-a_{1}x_{1}^{(i)}-a_{2}x_{2}^{(i)}-\dots -a_{n}x_{n}^{(i)})(-1)

{\frac {\partial S}{\partial a_{1}}}=2\sum _{i=1}^{n}(y^{(i)}-a_{0}-a_{1}x_{1}^{(i)}-a_{2}x_{2}^{(i)}-\dots -a_{n}x_{n}^{(i)})(-x_{1}^{(i)})

\cdots

{\frac {\partial S}{\partial a_{n}}}=2\sum _{i=1}^{n}(y^{(i)}-a_{0}-a_{1}x_{1}^{(i)}-a_{2}x_{2}^{(i)}-\dots -a_{n}x_{n}^{(i)})(-x_{n}^{(i)})

Gradient descent can be used to solve non-linear regression models.

Non-linear Regression

Non-linear regression models the relationship in observational data by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. Some nonlinear regression problems can be transformed to a linear domain. For example, solving $y=a_{0}+a_{1}x+a_{2}x^{2}$ is equivalent to solving $y=a_{0}+a_{1}x+a_{2}z$ where $z=x^{2}$ .

Here is an example of a gradient descent solution to a non-linear regression model.

Introduction to Numerical Methods/Regression

Contents