Linier Regression

LINEAR REGRESSION
Descriptive Statistics
These descriptive statistics are most often selected to

represent (1) the location of the center of the distribution of the
data and (2) the degree of spread of the data set.
Measure of Location
The most common measure of central tendency is the
(y)
arithmetic mean. The arithmetic mean of a sample is
defined as the sum of the individual data points (y i) divided by
the number of points (n)
y i
y i1
n
Measures of Spread
The most common measure of spread for a sample is the
standard deviation (sy) about the mean:
st
sy
n 1
n
st (yi y)2
n1
Thus, if the individual measurements are spread out widely

around the mean, st (and, consequently, sy) will be large. If they
are grouped tightly, the standard deviation will be small. A final
statistic that has utility in quantifying the spread of data is the
coefficient of variation (c.v.). This statistic is the ratio of the
standard deviation to the mean.
sy
c.v 100%
y
y sy
If a quantity is normally distributed, the range defined by
will encompass approximately 68% of the total measurements
y 2sy
or the range defined by will encompass approximately
95%.
Example:
Compute the mean, standard deviation, and coefficient of
variation for the data in the following table.
Measurements of the coefficient of thermal expansion of

structural steel
6,495 6,615 6,485 6,665 6,435 6,715
6,755 6,715 6,655 6,595 6,635 6,555
6,505 6,625 6,655 6,625 6,575 6,605
6,515 6,395 6,685 6,565 6,555 6,775
Solution:
i yi y (yi y)2 n
st c.v
st (yi y)2 sy
n1 n1
6.60 1.47
1 6.495 0 0.011 0.217 0.097 2
2 6.755 0.024
3 6.505 0.009
4 6.515 0.007
5 6.615 0.000
6 6.715 0.013
7 6.625 0.001
8 6.395 0.042
9 6.485 0.013
1
0 6.655 0.003
1
1 6.655 0.003
1
2 6.685 0.007
1
3 6.665 0.004
1
4 6.595 0.000
1
5 6.625 0.001
1
6 6.565 0.001
1
7 6.435 0.027
1
8 6.635 0.001
1
9 6.575 0.001
2
0 6.555 0.002
2
1 6.715 0.013
2
2 6.555 0.002
2
3 6.605 0.000
2
4 6.775 0.031
Linear Least-Squares Regression
Where substantial error is associated with data, the best curve-

fitting strategy is to derive an approximating function that fits
the shape or general trend of the data without necessarily
matching the individual points.
To remove this subjectivity, some criterion must be devised to

establish a basis for the fit. One way to do this is to derive a
curve that minimizes the discrepancy between the data points
and the curve. To do this, we must first quantify the
discrepancy.
The simplest example is fitting a straight line to a set of paired

observations: (x1,y1), (x2,y2),...,(xn,yn).
The mathematical expression for the straight line is
y a 0 a1 x e
where a0 and a1 are coefficients representing the intercept and

the slope, respectively, and e is the error, or residual, between
the model and the observations, which can be represented by
rearranging equations
e y a 0 a1 x
One strategy for fitting a best line through the data would be
to minimize the sum of the square residual errors for all the
available data
n
Sr e i
i1
n
(yi a 0 a1xi )2
i1
In order to get the best fit, we must differentiate S r to ai equal
to zerro
Sr n
2(yi a 0 a1xi) 0
a 0 i1
Sr n
2(yi a 0 a1xi)xi 0
a1 i1
rearranging the equations yield
n n n
y a a x 0
i1
i
i1
0
i1
1 i
n n n
xyi i a 0xi a1xi2 0

i1 i1 i1
then
n
n

n x i
a

y i
i1
0
i1
n n
a n
x xi2 1
xiy i
i1

i
i1 i1
n n
n

xi x i y
2
a 0 1
i
i1 i1

i1
n
x
2
a 1
n
n
n
xiy i n
n x xi i 2

i1 i1 i1
i
i1

n n n n

i i i i i

2
y x x y x
1 i1 i1 i1 i1

2 n n n
n
n
n xi2 xi n xiyi xi yi
i1 i1 i1 i1 i1
standard deviation for the regression line can be determined

as
Sr
sy/ x
n 2
where sy/x is called the standard error of the estimate. s y/x
quantifies the spread around the regression line as shown
below in contrast to the standard deviation sy that quantified
the spread around the mean.
st
sy
n 1
n
st (yi y)2
n1
The difference between the two quantities, S t Sr, quantifies

the improvement or error reduction due to describing the data
in terms of a straight line rather than as an average value.
Because the magnitude of this quantity is scale-dependent, the
difference is normalized to St to yield
St Sr
r2
St
where r2 is called the coefficient of determination and r is the

correlation coefficient. For a perfect fit, S r = 0 and r2 = 1,
signifying that the line explains 0% of the variability of the
data. For r2 = 0, Sr = St and the fit represents no improvement.
r2 is close to 1 does not mean that the fit is necessarily
good. For example, it is possible to obtain a relatively high
value of r2 when the underlying relationship between y and x is
not even linear.
A nice example was developed by Anscombe (1973). As in the

next figure, he came up with four data sets consisting of 11
data points each. Although their graphs are very different, all
have the same best-fit equation, y = 3 + 0.5x, and the same
coefficient of determination, r2 = 0.67. This example
dramatically illustrates why developing plots is so valuable.
Example:
In this application, force is the dependent variable ( y) and
velocity is the independent variable (x).
xi 10 20 30 40 50 60 70 80
yi 25 70 380 550 610 1220 830 1450
a. Find linear least-square fit: F = a0 + a1v, and calculate

standard error of the estimate and coefficient of
determination for the fit.
b. Find linear least-square fit: F = a1v, and calculate standard
error of the estimate and coefficient of determination for
the fit.
c. Sketch the graph for both problem (a) and (b)
Linearization of Nonlinear Relationships
Transformations can be used to express the data in a form that

is compatible with linear regression. One example is the
exponential model
y a1eb1x
where a1 and b1 are constants. Another example of a nonlinear
model is the simple power equation
y a 2 xb 2
where 2 and b2 are constant coefficients. A third example of a

nonlinear model is the saturation-growth-rate equation
a3x
y
b3 x
where 3 and b3 are constant coefficients. Nonlinear regression

techniques are available to fit these equations to experimental
data directly. However, a simpler alternative is to use
mathematical manipulations to transform the equations into a
linear form. Then linear regression can be employed to fit the
equations to data.
y a1eb1x
Equation can be linearized by taking its natural
logarithm to yield
lny lna1 b1x
Thus, a plot of ln y versus x will yield a straight line with a slope
of b1 and an intercept of ln 1
y a 2 xb 2
Equation is linearized by taking its base-10 logarithm to
give
log y loga 2 b 2 log x
Thus, a plot of log y versus log x will yield a straight line with a
slope of b2 and an intercept of log 2. Note that any base
logarithm can be used to linearize this model. However, as
done here, the base-10 logarithm is most commonly employed.
a 3x
y
b3 x
Equation is linearized by inverting it to give
1 1 b3 1

y a3 a3 x
Thus, a plot of 1/y versus 1/x will be linear, with a slope of b 3/3
and an intercept of 1/3.
Example:
Fit the following data using logarithmic transformation. General
relationship between force versus velocity is desribed by
F avb
Calculate standard error of the estimate and coefficient of

determination for the fit.
v, m/s 10 20 30 40 50 60 70 80
F, N 25 70 380 550 610 1220 830 1450
POLYNOMIAL REGRESSION
The least-squares procedure can be readily extended to fit the
data to a higher-order polynomial. For example, suppose that
we fit a second-order polynomial or quadratic:
y a 0 a1x a 2 x2 e
For this case the sum of the squares of the residuals is
n
Sr (yi a 0 a1xi a 2 xi2 )2
i1
To generate the least-squares fit, we take the derivative of eq.

with respect to each of the unknown coefficients of the
polynomial
Sr n
2(yi a 0 a1xi a 2 xi2 ) 0
a 0 i1
Sr n
2(yi a 0 a1xi a 2 xi2 )xi 0
a1 i1
Sr n
2(yi a 0 a1xi a 2 xi2 )xi2 0
a 2 i1
or
n n
n

n xi xi2 y i
i1 i1
a 0 i1

n n n
n

x x i
2
i xi3 a 1 x y i i 0
i1 i1 i1
a i1
n n n
2 n
x x x y
2 3
i i xi4
2
i i
i1 i1 i1 i1
Example:
Fit a second-order polynomial to the data
x 0 1 2 3 4 5
y 2.1 7.7 13.6 27.2 40.9 61.1
MULTIPLE LINEAR REGRESSION

Another useful extension of linear regression is the case where
y is a linear function of two or more independent variables. For
example, y might be a linear function of x 1 and x2, as in
y a 0 a1x1 a 2 x2 e
For this case the sum of the squares of the residuals is
n
Sr (yi a 0 a1x1,i a 2 x2,i)2
i1
and differentiating with respect to each of the unknown

coefficients:
Sr n
2(yi a 0 a1x1,i a 2 x2,i) 0
a 0 i1
Sr n
2(yi a 0 a1x1,i a 2 x2,i)x1,i 0
a1 i1
Sr n
2(yi a 0 a1x1,i a 2 x2,i)x2,i 0
a 2 i1
n n n
n

1 x1,i x2,i y i
i1 i1 i1
a 0 i1

n n n
n

x 1,i x 2
1,i x1,ix2,i a 1 x 1,iyi 0
i1 i1 i1
a i1
2 n
2
n n n
x 2 ,i x 1,i x2,i x2,i x2,iy i

i1 i1 i1 i1
Example:
Use multiple linear regression to fit this data
x1 0 2 2.5 1 4 7
x2 0 1 2 3 6 2
y 5 10 9 0 3 27
The foregoing two-dimensional case can be easily extended to

m dimensions, as in
y a 0 a1x1 a 2 x2 ... amxm e
Although there may be certain cases where a variable is

linearly related to two or more other variables, multiple linear
regression has additional utility in the derivation of power
equations of the general form
y a 0 x1a1 xa2 2 xa3 3 ...xm
am
To use multiple linear regression, the equation is transformed

by taking its logarithm to yield
log y loga 0 a1 log x1 a 2 log x2 ... a m log xm

Linier Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linier Regression

Uploaded by

Copyright:

Available Formats

LINEAR REGRESSION

These descriptive statistics are most often selected to

Thus, if the individual measurements are spread out widely

Measurements of the coefficient of thermal expansion of

Where substantial error is associated with data, the best curve-

To remove this subjectivity, some criterion must be devised to

The simplest example is fitting a straight line to a set of paired

where a0 and a1 are coefficients representing the intercept and

rearranging the equations yield

xyi i a 0xi a1xi2 0

standard deviation for the regression line can be determined

The difference between the two quantities, S t Sr, quantifies

where r2 is called the coefficient of determination and r is the

A nice example was developed by Anscombe (1973). As in the

a. Find linear least-square fit: F = a0 + a1v, and calculate

Linearization of Nonlinear Relationships

Transformations can be used to express the data in a form that

where 2 and b2 are constant coefficients. A third example of a

where 3 and b3 are constant coefficients. Nonlinear regression

log y loga 2 b 2 log x

Calculate standard error of the estimate and coefficient of

For this case the sum of the squares of the residuals is

To generate the least-squares fit, we take the derivative of eq.

MULTIPLE LINEAR REGRESSION

For this case the sum of the squares of the residuals is

and differentiating with respect to each of the unknown

x 2 ,i x 1,i x2,i x2,i x2,iy i

The foregoing two-dimensional case can be easily extended to

Although there may be certain cases where a variable is

To use multiple linear regression, the equation is transformed

log y loga 0 a1 log x1 a 2 log x2 ... a m log xm

You might also like