Statistics/Curve fitting

Whenever trying to evaluate data that has been collected, often patterns appear, such as a -1 slope when making a scatter plot of $1/d_{i}=1/f-1/d_{o}$ in ray optics. It may often be the goal to find a mathematical function that "fits" the data. That is to say a function whose values are close to the data values at the corresponding values and independent values. This is often referred to as the "least squares", and the reason for which is explained later.

Sales Example

A store sells whatsits at P=3.49 each and the average number of whatsits sold (the volume) per day is V=100. Therefore the total money received T=P times V=349.00 ..... If the price is reduced then, maybe, more whatsits will be sold, but T may be more or less. Obviously if P=0 then T will also be zero. The following was the result:

        P          V           T
      2.99        130        388.70
      3.29        123        404.67
      3.49        100        349.00

Obviously the "best" price is somewhere between 2.99 and 3.49. ..... Curve fitting provides an equation for T versus P for each of the many models that are available for comparison.

Linear model

The linear model is based on the "best" straight line. Using a calculator that can do regression, we find for the above data that the closest line of the graph showing T versus P is

T=605.268605263 - 68.9289473684 * P, and the correlation is shown as about 60% for this model.

Let us examine it in more detail:

   P    Actual T   Calculated T       Difference     Difference²

  2.99   388.70  399.17105263159  - 10.4710526316   109.642943214
  3.29   404.67  378.49236842106    26.1776315789   685.268395081
  3.49   349.00  364.70657894738  - 15.7065789474   246.696622231

Adding the differences, we find that their sum is nearly zero, indicating that it is the "best" linear model. Squaring a negative number always gives a positive number. so that the SUM OF SQUARES will give us an indication of the GOODNESS OF FIT. Here the SUM OF SQUARES is 1041.60796053, and we can compare the different models, selecting finally the model that has the LEAST SQUARES.

If you do NOT have a calculator or a computer that can do regression, then.....

Calculation of the least square line to fit the given points:

LOOKING FOR a and b in the equation of the straight line y=a+b*x:

We have, in the above example:

    x       x²      y       y²             xy

  2.99   8.9401  388.70  151087.69     1162.213
  3.29  10.8241  404.67  163757.8089   1331.3643
  3.49  12.1801  349.00  121801        1218.01
  ----  ------- -------  -----------   ---------
  9.77  31.9443 1142.37  436646.4989   3711.5873

We have: n = number of points = 3
ax=average of x=9.77/3=3.256
ay=average of y=1142.37/3=380.79
x1=sum of x=9.77
x2=sum of x²=31.9443
y1=sum of y=1142.37
y2=sum of y²=436646.4989
s1=sum of xy=3711.5873
z1=s1-(x1*y1/n)=3711.5873-(9.77*1142.37/3)= -8.731
z2=x2-(x1²/n)=31.9443-9.77²/3=0.126
b=z1/z2=-68.9289473682

a=ay-b*ax=380.79-(-68.9289473682)*3.256=605.268605263

Thus we have y=605.268605263-68.92894736828*x as the best line to fit the given points of this example.

Parabolic Model

If we have n points, then a polynomial of (n-1) degree will fit these n points exactly. We are given in this example 3 points, and a polynomial of the 2nd degree (parabola) should give us an exact fit. The calculator provides the equation
(-663.1666666653)x² + 4217.91999999x-6294.10448332, giving us

   P    Actual T   Calculated T       Difference

  2.99   388.70   388.6999999956      4.4E-9 = zero plus rounding error
  3.29   404.67   404.6699999951      4.9E-9 = zero plus rounding error
  3.49   349.00   348.999999995       5.0E-8 = zero plus rounding error

That is a perfect fit, with the LEAST SQUARES indicating that this model be used.

Other models

Some of the many other models are based on the exponential function, logarithms, and various manipulations of the independent and/or the dependent variable(s). The "best fit" is usually the one that provides the LEAST SQUARES. Also weighting of the data could be used when some points on a graph are more important than others (such as, maybe, end points, for example).

Caution: Some calculators may require for Curve fitting consecutive, equally spaced, independent variables. Always compare the original graph with the "fitted" graph.