Statistics Ground Zero/Regression

Regression

Regression analysis is the process of building a model of the relationship between variables in the form of mathematical equations. The general purpose is to explain how one variable, the dependent variable, is systematically related to the values of one or more independent variables. The independent variable is so called because we imagine its value varying freely across its range while the dependent variable is dependent upon the values taken by the independent. The mathematical function is expressed in terms of a number of parameters that are the coefficients of the equation, and the values of the independent variable. The coefficients are numeric constants by which variable values in the equation are multiplied or which are added to a variable value to determine the unknown. A simple example is the equation for the line

${\displaystyle y=mx+b}$

Here, by convention, x and y are the variables of interest in our data with y the unknown or dependent variable and x the known or independent variable. The constant m is slope of the line and b is the y-intercept - the value where the line cross the y axis. So, m and b are the coefficients of the equation.

If we can build a robust regression model of known data then we can use the equation to predict the values for unobserved cases. Regression also involves the estmation of the strength of the assocation between the dependent and independent variables, most often through the computation of the correlation coefficient which as noted above is itself part of a linear model of the data. The correlation coefficient is squared in reporting a regression analysis and we call it, perhaps rather obviously, R squared.

Models are often only approximately like the observed data. Most often some error in the data will mean that no mathematical function produces exactly the data observed and only that data. Therefor we are explicitly involved in estimation and our models involve recognition of error. It is for this reason that the equation for the line is often given as

${\displaystyle {\hat {y}}=mx+b\pm \epsilon {}}$

Where ε quantifies the error in our data.

Linear regression

In linear regression the model consists of linear equations. A linear equation is an equation involving only constants or single variable values multiplied by a constant. The variable values in a linear equation must be of the first power, that is they cannot involving raising a value to a power other than one - they cannot for example be squared or cubed. Any value raised to the power of one is just the original value: x1=x.

To carry out a linear regression, first make a scatter plot of your data.

Let us look again at the following measurements of baby height against age:

Age (months Height (cm)
0 53.0
3 59.5
6 66.0
9 71.5
12 76.0
18 85.0
24 90.0

This data can be visualised in this scatterplot: