You are on page 1of 14

LINEAR REGRESSION

Descriptive Statistics

These descriptive statistics are most often selected to


represent (1) the location of the center of the distribution of the
data and (2) the degree of spread of the data set.

Measure of Location
The most common measure of central tendency is the
(y)
arithmetic mean. The arithmetic mean of a sample is
defined as the sum of the individual data points (y i) divided by
the number of points (n)

y i
y i1
n

Measures of Spread
The most common measure of spread for a sample is the
standard deviation (sy) about the mean:

st
sy
n 1
n
st (yi y)2
n1

Thus, if the individual measurements are spread out widely


around the mean, st (and, consequently, sy) will be large. If they
are grouped tightly, the standard deviation will be small. A final
statistic that has utility in quantifying the spread of data is the
coefficient of variation (c.v.). This statistic is the ratio of the
standard deviation to the mean.
sy
c.v 100%
y

y sy
If a quantity is normally distributed, the range defined by
will encompass approximately 68% of the total measurements
y 2sy
or the range defined by will encompass approximately
95%.

Example:
Compute the mean, standard deviation, and coefficient of
variation for the data in the following table.

Measurements of the coefficient of thermal expansion of


structural steel
6,495 6,615 6,485 6,665 6,435 6,715
6,755 6,715 6,655 6,595 6,635 6,555
6,505 6,625 6,655 6,625 6,575 6,605
6,515 6,395 6,685 6,565 6,555 6,775
Solution:
i yi y (yi y)2 n
st c.v
st (yi y)2 sy
n1 n1
6.60 1.47
1 6.495 0 0.011 0.217 0.097 2
2 6.755 0.024
3 6.505 0.009
4 6.515 0.007
5 6.615 0.000
6 6.715 0.013
7 6.625 0.001
8 6.395 0.042
9 6.485 0.013
1
0 6.655 0.003
1
1 6.655 0.003
1
2 6.685 0.007
1
3 6.665 0.004
1
4 6.595 0.000
1
5 6.625 0.001
1
6 6.565 0.001
1
7 6.435 0.027
1
8 6.635 0.001
1
9 6.575 0.001
2
0 6.555 0.002
2
1 6.715 0.013
2
2 6.555 0.002
2
3 6.605 0.000
2
4 6.775 0.031
Linear Least-Squares Regression

Where substantial error is associated with data, the best curve-


fitting strategy is to derive an approximating function that fits
the shape or general trend of the data without necessarily
matching the individual points.

To remove this subjectivity, some criterion must be devised to


establish a basis for the fit. One way to do this is to derive a
curve that minimizes the discrepancy between the data points
and the curve. To do this, we must first quantify the
discrepancy.

The simplest example is fitting a straight line to a set of paired


observations: (x1,y1), (x2,y2),...,(xn,yn).
The mathematical expression for the straight line is

y a 0 a1 x e

where a0 and a1 are coefficients representing the intercept and


the slope, respectively, and e is the error, or residual, between
the model and the observations, which can be represented by
rearranging equations

e y a 0 a1 x

One strategy for fitting a best line through the data would be
to minimize the sum of the square residual errors for all the
available data

n
Sr e i
i1
n
(yi a 0 a1xi )2
i1
In order to get the best fit, we must differentiate S r to ai equal
to zerro

Sr n
2(yi a 0 a1xi) 0
a 0 i1

Sr n
2(yi a 0 a1xi)xi 0
a1 i1

rearranging the equations yield

n n n

y a a x 0
i1
i
i1
0
i1
1 i

n n n

xyi i a 0xi a1xi2 0


i1 i1 i1

then
n
n

n x i
a

y i
i1
0
i1

n n
a n

x xi2 1
xiy i
i1

i
i1 i1

n n
n

xi x i y
2

a 0 1
i

i1 i1

i1
n
x
2
a 1
n
n
n
xiy i n
n x xi i 2

i1 i1 i1
i
i1

n n n n

i i i i i

2
y x x y x
1 i1 i1 i1 i1



2 n n n
n
n
n xi2 xi n xiyi xi yi
i1 i1 i1 i1 i1

standard deviation for the regression line can be determined


as
Sr
sy/ x
n 2
where sy/x is called the standard error of the estimate. s y/x
quantifies the spread around the regression line as shown
below in contrast to the standard deviation sy that quantified
the spread around the mean.

st
sy
n 1
n
st (yi y)2
n1

The difference between the two quantities, S t Sr, quantifies


the improvement or error reduction due to describing the data
in terms of a straight line rather than as an average value.
Because the magnitude of this quantity is scale-dependent, the
difference is normalized to St to yield

St Sr
r2
St

where r2 is called the coefficient of determination and r is the


correlation coefficient. For a perfect fit, S r = 0 and r2 = 1,
signifying that the line explains 0% of the variability of the
data. For r2 = 0, Sr = St and the fit represents no improvement.
r2 is close to 1 does not mean that the fit is necessarily
good. For example, it is possible to obtain a relatively high
value of r2 when the underlying relationship between y and x is
not even linear.

A nice example was developed by Anscombe (1973). As in the


next figure, he came up with four data sets consisting of 11
data points each. Although their graphs are very different, all
have the same best-fit equation, y = 3 + 0.5x, and the same
coefficient of determination, r2 = 0.67. This example
dramatically illustrates why developing plots is so valuable.

Example:
In this application, force is the dependent variable ( y) and
velocity is the independent variable (x).

xi 10 20 30 40 50 60 70 80
yi 25 70 380 550 610 1220 830 1450

a. Find linear least-square fit: F = a0 + a1v, and calculate


standard error of the estimate and coefficient of
determination for the fit.
b. Find linear least-square fit: F = a1v, and calculate standard
error of the estimate and coefficient of determination for
the fit.
c. Sketch the graph for both problem (a) and (b)

Linearization of Nonlinear Relationships

Transformations can be used to express the data in a form that


is compatible with linear regression. One example is the
exponential model
y a1eb1x
where a1 and b1 are constants. Another example of a nonlinear
model is the simple power equation

y a 2 xb 2

where 2 and b2 are constant coefficients. A third example of a


nonlinear model is the saturation-growth-rate equation

a3x
y
b3 x

where 3 and b3 are constant coefficients. Nonlinear regression


techniques are available to fit these equations to experimental
data directly. However, a simpler alternative is to use
mathematical manipulations to transform the equations into a
linear form. Then linear regression can be employed to fit the
equations to data.

y a1eb1x
Equation can be linearized by taking its natural
logarithm to yield
lny lna1 b1x
Thus, a plot of ln y versus x will yield a straight line with a slope
of b1 and an intercept of ln 1
y a 2 xb 2
Equation is linearized by taking its base-10 logarithm to
give

log y loga 2 b 2 log x

Thus, a plot of log y versus log x will yield a straight line with a
slope of b2 and an intercept of log 2. Note that any base
logarithm can be used to linearize this model. However, as
done here, the base-10 logarithm is most commonly employed.

a 3x
y
b3 x
Equation is linearized by inverting it to give
1 1 b3 1

y a3 a3 x

Thus, a plot of 1/y versus 1/x will be linear, with a slope of b 3/3
and an intercept of 1/3.
Example:
Fit the following data using logarithmic transformation. General
relationship between force versus velocity is desribed by

F avb

Calculate standard error of the estimate and coefficient of


determination for the fit.

v, m/s 10 20 30 40 50 60 70 80
F, N 25 70 380 550 610 1220 830 1450
POLYNOMIAL REGRESSION
The least-squares procedure can be readily extended to fit the
data to a higher-order polynomial. For example, suppose that
we fit a second-order polynomial or quadratic:

y a 0 a1x a 2 x2 e

For this case the sum of the squares of the residuals is

n
Sr (yi a 0 a1xi a 2 xi2 )2
i1

To generate the least-squares fit, we take the derivative of eq.


with respect to each of the unknown coefficients of the
polynomial

Sr n
2(yi a 0 a1xi a 2 xi2 ) 0
a 0 i1

Sr n
2(yi a 0 a1xi a 2 xi2 )xi 0
a1 i1

Sr n
2(yi a 0 a1xi a 2 xi2 )xi2 0
a 2 i1

or

n n
n

n xi xi2 y i
i1 i1
a 0 i1

n n n
n

x x i
2
i xi3 a 1 x y i i 0
i1 i1 i1
a i1
n n n
2 n
x x x y
2 3
i i xi4
2
i i
i1 i1 i1 i1
Example:
Fit a second-order polynomial to the data

x 0 1 2 3 4 5
y 2.1 7.7 13.6 27.2 40.9 61.1

MULTIPLE LINEAR REGRESSION


Another useful extension of linear regression is the case where
y is a linear function of two or more independent variables. For
example, y might be a linear function of x 1 and x2, as in

y a 0 a1x1 a 2 x2 e

For this case the sum of the squares of the residuals is

n
Sr (yi a 0 a1x1,i a 2 x2,i)2
i1

and differentiating with respect to each of the unknown


coefficients:
Sr n
2(yi a 0 a1x1,i a 2 x2,i) 0
a 0 i1

Sr n
2(yi a 0 a1x1,i a 2 x2,i)x1,i 0
a1 i1

Sr n
2(yi a 0 a1x1,i a 2 x2,i)x2,i 0
a 2 i1

n n n
n

1 x1,i x2,i y i
i1 i1 i1
a 0 i1

n n n
n

x 1,i x 2
1,i x1,ix2,i a 1 x 1,iyi 0
i1 i1 i1
a i1

2 n
2
n n n

x 2 ,i x 1,i x2,i x2,i x2,iy i


i1 i1 i1 i1
Example:
Use multiple linear regression to fit this data

x1 0 2 2.5 1 4 7
x2 0 1 2 3 6 2
y 5 10 9 0 3 27

The foregoing two-dimensional case can be easily extended to


m dimensions, as in
y a 0 a1x1 a 2 x2 ... amxm e

Although there may be certain cases where a variable is


linearly related to two or more other variables, multiple linear
regression has additional utility in the derivation of power
equations of the general form
y a 0 x1a1 xa2 2 xa3 3 ...xm
am

To use multiple linear regression, the equation is transformed


by taking its logarithm to yield

log y loga 0 a1 log x1 a 2 log x2 ... a m log xm

You might also like