You are on page 1of 15

MH3510/MTH352 Regression Analysis

Chapter 2: Simple linear regression

Simple Linear Regression is to use a straight line to explain the relationship


between two variables, one is the response variable and the other is the predictor
variable.

1 Formal Model formulation


In this chapter, we consider the simple case with just one predictor variable X:

Y : response variable, X: predictor variable.

A sample of n observations is observed in the form of (xi , yi ), i = 1, . . . , n.


X x1 x2 xn
Y y1 y2 yn

DEFINITION 1 A simple linear regression model assumes:

yi = 0 + 1 xi + i , i = 1, 2, , n (1.1)

where

x1 , . . . , xn are known non-random constants and

the errors 1 , . . . , n are identically independently distributed (i.i.d.) with


Ei = 0 and var(i ) = 2 .

The parameters 0 and 1 are called regression coecients. 0 and 1 are,


respectively, the intercept and the slope of the regression line. Their values are
unknown to us. There exists a huge literature about how to estimate them. In this
semester, the least squares estimate (LSE) is mainly considered. Under the simple
linear regression model, the responses yi are expected to uctuate about their
means 0 + 1 xi by some random errors i . A plot of yi against xi is expected to
exhibit a linear trend, subject to such random errors. Otherwise, we need some
transformations rst, or refer to some more complicated methods.

1
EXAMPLE 1 (Soybean Yield and Fertilizer) Suppose that soybean yield is deter-
mined by the model
yield = 0 + 1 fertilizer + .
So that y = soybean yield of each plot, x = amounts of fertilizer applied to each
plot, and contains other factors such as land quality, rainfall, and so on.

EXAMPLE 2 (A Simple Wage Equation) A model relating a persons wage to ob-


served education and other unobserved factors is

wage = 0 + 1 educ + .

Hence, y = dollars per hour, x = years of education, and includes labor


experience, innate ability, tenure with current employer, work ethic, and other
things affecting a persons wage level.

2 Least squares estimates


We desire a strategy for estimating the slope and intercept parameters in the
model (1.1). A reasonable way to fit a line is to minimize the amount by which
the fitted value differs from the actual value. This amount is called the residual.
A simple linear regression model to the data pairs is often called regressing
Y on X, i.e. to find estimates 0 , 1 for 0 , 1 such that the fitted regression line

yi = 0 + 1 xi

best fits or explains the observations.


The residual, ei , for observation i is

ei = yi yi = yi 0 1 xi .

There are n such residuals.


Before proceeding, we list some important facts about the summation opera-
tor.
P
. Fact 1: if x = n1 ni=1 xi ,, then
n
X
(xi x) = 0. (2.1)
i=1

. Fact 2:
n
X n
X n
X
2
(xi x) = (xi x)xi = xi2 nx. (2.2)
i=1 i=1 i=1

2
DEFINITION 2 Least squares (LS) method Choose 0 and 1 to minimize the
sum of squared residuals, that is,
P
find 0 , 1 to minimize Q = ni=1 (yi 0 1 xi )2 w.r.t. 0 , 1 .

From calculus, the first order necessary conditions for (0 , 1 ) to minimize Q


are given by
dQ Pn
= 2(yi (0 + 1 xi ))(1)
d0 i=1
dQ Pn
= 2(yi (0 + 1 xi ))(xi ).
d1 i=1

Setting it to zero yields the normal equations


X
(yi 0 1 xi ) = 0, (2.3)
i
X
xi (yi 0 1 xi ) = 0. (2.4)
i

The LSE of 0 and 1 are found by solving the normal equations for 0 , 1 :

0 = y 1 x
and 1 = Sxy /Sxx ,
P
where sample mean y is defined as y = n1 ni=1 yi and sum of squares are defined
by
n
X n
X
Sxx = (xi x)2 , Sxy = (xi x)(yi y).
i=1 i=1

A more efficient formula for the calculation of 1 is


P
n
xi yi nx y
i=1
1 = .
Sxx
The fitted regression line is

yi = 0 + 1 xi or yi y = 1 (xi x).

THEOREM 1 Gauss-Markov Theorem: Among all estimates that are linear


combination of y1 , ..., yn and unbiased, the LSE has the smallest variance.

3
Figure 1: Residual for the ith observation.

Figure 2: Residual for the ith observation.

EXAMPLE 3 Example 1.1 (contd)


Y cholesterol level, X age:

x = 39.417, y = 3.354, Sxx = 4139.834, Sxy = 217.858,

0 = 1.280, 1 = 0.0526.
Fitted regression line:
y = 1.280 + 0.0526x,
which is shown in Figure 1.1(b) for reference.
Note: Graphically, the regression line is chosen to minimize the sum of squared
vertical departures of all observations from the line.

3 Properties of fitted regression line


There are some important algebraic properties of the fitted least square regres-
sion line. These properties hold for any sample of data. First, the sum of the LS

4
residuals is zero i.e.
n
X
ei = 0,
i=1

in view of (2.3). Second, the sum of the observed values equals the sum of the
fitted value:
Xn Xn
yi = yi .
i=1 i=1

Third, the sample covariance between the regressors and the LS residuals is zero.
Mathematically, by (2.4)
Xn
xi ei = 0.
i=1

Lastly, the regression line always goes through the point (x, y).

4 Estimation of 2 in SLR
It seems reasonable to assume that the greater the variability of the random
error (which is measured by its variance 2 ), the greater will be the errors in
the estimation of the model parameters 0 and 1 , and in the error of prediction
when y is used to predict y for some values of x. Consequently, you should
not be surprised, as we proceed through this chapter, to find 2 appears in the
formulas for all confidence intervals and test statistics that we use.
In most practical situations, 2 will be unknown and we must use the data to
estimate its value.
Recall the way of estimating 2 from i.i.d sample Y1 , , Yn with EYi = 0 and
var(Yi ) = 2 .

. find
n
X n
X
2
(Yi EYi ) = (Yi Y )2 .
i=1 i=1

Square the difference between each observation and the estimate of its
mean.

. divide by degrees of freedom


n
1 X
s2 = (Yi Y )2
n1
i=1

SLR model with E(yi ) = 0 + 1 xi and var(yi ) = 2 , independent but not


identically distributed. Lets do the same two steps.

5
. find
n
X n
X
(yi Eyi )2 = (yi (0 + 1 xi ))2 = SSE.
i=1 i=1
Square the difference between each observation and the estimate of its
mean. Here SSE denotes the error sum of squares.

. divide by degrees of freedom


n
1 X SSE
s2 = (yi (0 + 1 xi ))2 = .
n2 n2
i=1

s2 is an unbiased estimate of 2 , that is


E(s2 ) = 2 .

5 Inference about regression parameters


In SLR analysis, the regression parameters 0 , 1 are of our main interest. The
lse 0 and 1 provide point estimates for the parameters, and would vary from
sample to sample. To assess the accuracy of the lse, we ought to investigate the
sampling distribution of 0 and 1 .
To set up interval estimates and make tests we need to make assumptions
about the distribution of i . Throughout this section and the remaining sections
we assume the random errors i in model (1.1) are independent N(0, 2 ). This
implies, that the responses {yi } are independent random variables with
yi N(0 + 1 xi , 2 ). (5.1)
Fact 3: If yi N(i , i2 ), yi s are independent, and a1 , , an are known
constants then
X n Xn n
X
ai yi N( ai i , ai2 i2 ). (5.2)
i=1 i=1 i=1
Thus, a linear combination of independent normal random variables is itself a
normal random variable.

THEOREM 2 LSE estimators 0 and 1 are linear combinations of the yi s. That


is, we can write
n
X Xn
1 = ki yi , and 0 = li yi ,
i=1 i=1

where ki s and li s are known constant.

6
PROOF Note that

1 hX i
n n n
1 X X
1 = (xi x)(yi y) = (xi x)yi y (xi x)
Sxx Sxx
i=1 i=1 i=1

n 
xi x 
n
1 X X
= (xi x)yi = yi
Sxx Sxx
i=1 i=1
n
X xi x
= ki yi with ki = ,
Sxx
i=1

where the third step uses (2.1).


By the argument for 1 we have
n n
1X X
0 = y 1 x = yi x ki yi
n
i=1 i=1

n 
X  n
X
1 1
= ki x yi = li yi with li = ki x.
n n
i=1 i=1

Thus Theorem 2 is true and hence, 0 and 1 are normal variables by Fact 3.
What about their means and variances ?

THEOREM 3 Under SLR model with normal errors:


P
n
2 xi2
2 i=1
1 N(1 , ), and 0 N(0 , ).
Sxx nSxx

PROOF By Fact 3 and (5.1) we get


n
X n
X n
X n
X
E(1 ) = ki E(yi ) = ki (0 + 1 xi ) = 0 ki + 1 ki xi = 1 ,
i=1 i=1 i=1 i=1

because via (2.1) and (2.2)


n
X n
X n
xi x 1 X
ki = = (xi x) = 0
Sxx Sxx
i=1 i=1 i=1

and
n
X n
X xi x Sxx
ki xi = xi = = 1.
Sxx Sxx
i=1 i=1

7
We also have
X
n  Xn n
X 2
V ar(1 ) = V ar ki yi = ki2 V ar(yi ) = 2 ki2 = ,
Sxx
i=1 i=1 i=1

where
n
X n
X (xi x)2 1
ki2 = = .
Sxx Sxx
i=1 i=1

Proving the remaining part of the theorem is basically the same.

5.1 Confidence intervals and tests for 1


2
The key is 1 N(1 , Sxx ). Thus

1
p1 N(0, 1).
2 /Sxx

But this is not useful because we do not know 2 . Replacing 2 with the estimator
s2 , we obtain
1
p1 t(n 2).
s2 /Sxx
In what follows, is type 1 error probability = P(reject H0 |H0 true), always
between 0 and 1 and usually set at 0.01, 0.05 or 0.10. The level hypothesis tests
concerning 1 are classified as follows:

. A Two-Sided test H0 : 1 = c, H1 : 1 6= c

. B One-Sided test H0 : 1 c, H1 : 1 < c

. C One-Sided test H0 : 1 c, H1 : 1 > c

The test statistic is


1 1
t = p .
s2 /Sxx
The rejection rules are

. A Reject H0 if |t | > tn2
2
.

. B Reject H0 if t tn2

. C Reject H0 if t > tn2 .

8
Alternatively reject H0 for case A if p value = P(|tn2 | > Tobserved ) is smaller than
. Moreover, the confidence interval for case A is
1/2 /2
1 sSxx tn2 .

EXAMPLE 4 Example 1.1 (contd)



Recall that 1 = 0.0526, s = MSE = 0.11158 = 0.3340.
Suppose a 95% confidence interval for 1 is required. The  2.5% upper
 quan-
(0.025) 0.3340
tile of t22 is t22 = 2.074. Thus the interval is 0.0526 4139.833 2.074, i.e.
[0.0418, 0.0634]. Suppose we wish to know whether 1 = 0 (i.e. no linear relation-
ship between age and cholesterol level). Note that the above interval misses the
value 0, which suggests that 1 is significantly different from 0 (at the 5% level)
and that there is a significant linear relationship between age and cholesterol level.

Alternatively, we can perform a formal size 5% t test of the hypothesis that


1 = 0. The test statistic is T = 4139.833
0.3340 (0.0526 0) = 10.133. The critical value is
(0.025) (0.025)
t22 = 2.074. Here T > t22 and we should reject the hypothesis.
[ Again, the conclusion is consistent with what we found before. ]

5.2 Confidence intervals for mean response


Let x0 denote the level of x for which we wish to estimate the mean response

E(y) = 0 + 1 x0 .

The point estimate is


y0 = 0 + 1 x0 .
P P
Note that from Theorem 2 with 0 = li yi and 0 = ki yi
X X n
X
y0 = 0 = li yi + x0 ki yi = (ki + li x0 )yi . (5.3)
i=1

Thus y0 is normally distributed and its mean and variance are


1 (x0 x)2 2
E(y0 ) = 0 + 1 x0 , V ar(y0 ) = ( + ) .
n Sxx
It follows that
y E y
q 0 tn2 .
2
s ( n1 + (x0Sx)
xx
)
Correspondingly, an 100(1 )% confidence interval for Ey at x = x0 is
s
1 (x0 x)2 /2
y0 s ( + )tn2 .
n Sxx

9
EXAMPLE 5 Example 1.1 (contd)
Suppose you ar thinking of predicting the cholesterol level (m ) of several
patients whose age is 60. Then the predicted value is 0 + 601 = 4.436. The 95%
prediction interval becomes
 1/2
1 (60 39.417)2
4.436 0.3340 + 2.074,
24 4139.833
i.e. [4.173, 4.699].

5.3 Prediction interval for ynew


After collecting the data we might be interested in predicting a new observation
whose x value is x0 .
Attention: we estimated the mean of distribution of y in the above subsection
and we now predict an individual outcome drawn from the distribution of y.
Apparently
ynew = 0 + 1 x0 + new .
It is natural to predict ynew by

ynew = 0 + 1 x0 .

We then consider the difference


n
X
ynew ynew = ynew (ki + li x0 )yi
i=1

as in (5.3). Note that future response ynew is independent of ynew , which is de-
pendent on the past observations x1 , , xn . Since it is a linear combination of
normal random variables, it is a normal random variable with mean

E(ynew ) E(ynew ) = 0.

and variance
1 (x0 x)2 
V ar(ynew ynew ) = V ar(ynew ) + V ar(ynew ) = 2 + 2 + .
n Sxx
Therefore
y ynew
r new  t(n 2).
1 (x0 x)2
s 1 + n + Sxx

A (1 ) 100% prediction interval for ynew is given by


s
 1 (x0 x)2  /2
ynew s 1+ + tn2 .
n Sxx

10
EXAMPLE 6 Example 1.1 (contd)
Suppose we wish to predict the cholesterol level of a future patient whose age is
60 say. Then the predicted value is

0 + 601 = 4.436.

The 95% prediction interval is


 1/2
1 (60 39.417)2
4.436 0.3340 + +1 2.074,
24 4139.833

i.e. [3.695, 5.177].

6 Analysis of variance
An analysis of variance is a formal method, by tabulating the results in an anal-
ysis of variance (ANOVA) table, to check whether the fitted model is adequate.
It provides a different way of looking at what we have already done.
Consider the linear relationship

yi y = (yi y) + (yi yi ).

Is there a quadratic analogue ? To see this, we define the following sum of


squares.

. Total sum of squares: the variation in yi s


X
Syy = (yi y)2 .

. Regression sum of squres: the variation in yi s explained at x


X
SSR = (yi y)2 .

. Error sum of squares: the variation in yi s around the regression line


X
SSE = (yi yi )2 .

Note that yi y = 1 (xi x) and yi yi = yi y 1 (xi x). Hence,


n
X n
X
(yi yi )(yi y) = 1 (xi x)[yi y 1 (xi x)]
i=1 i=1
= 1 [Sxy 1 Sxx ] = 0

11
and
n
X n
X
Syy = (yi y)2 = [(yi yi ) + (yi y)]2
i=1 i=1
n
X n
X n
X
2 2
= (yi yi ) + (yi y) + 2 (yi yi )(yi y)
i=1 i=1 i=1
= SSE + SSR.

6.1 ANOVA table


In view of the above decomposition of Syy , ANOVA table for SLR is given as
follows.

Source of variation df Sum of squares (SS) MS F p-value


P
Regression 1 SSR = P (yi y)2 MSReg MSReg /s2
Residual n2 SSE = (yi yi )2 s2
P
Total n1 Syy = (yi y)2
Here df denotes degree of freedoms and MS denotes mean squares (=SS/df). The
degrees of freedom indicate how many independent pieces of information, which
involve the n independent numbers y1 , ..., yn , are needed to calculate the sum of
P
squares. For example, the quantity Syy = (yi y)2 needs (n 1) independent
pieces of information since the sum of y1 y, ..., yn y is equal to zero; The
P P
quantity SSR = (yi y)2 = 12 (xi x)2 needs only one independent piece of
information.
R2 Statistic is a useful statistic to check how good the regression fit is, and
has the value of
P 2
2 SSR (yi y)2 Sxy 2
R = =P = = rxy .
Syy (yi y)2 Sxx Syy
Hence, from the view point of linear regression, it measures the

proportion of total variation about the mean y explained by the re-


gression;

however, from the view point of correlation coefficients, it measures the

relationship between two variables X and Y .

It measures how strong the linear relationship between x and y is. The higher
the R2 , the stronger the linear relationship.
Based on the ANOVA table there is another way to test

H0 : 1 = 0 vsH1 : 1 6= 0.

12
Under the assumption of 1 = 0, the F-ratio

F = MSReg /s2 F(1, n 2).

Hence, this F-ratio can be used to check whether or not the predictor variable X
significantly contribute to the explanation of the response variable Y . Rejection
rule is to reject H0 if F > F(1, n2). Note that we now know the exact distribution
of F, and hence p-value can be calculated accordingly. Loosely speaking,

. Large F-ratio SSR contributes much to Syy X effective in explaining


variation in Y

. Small F-ratio SSR contributes little to Syy X not effective in explaining


variation in Y

EXAMPLE 7 Example 1.1 (contd)


ANOVA table:

Source of variation df Sum of squares (SS) MS F p-value


Regression 11.46477 1 11.46477 102.7473
Residual 2.45481 22 0.11158
Total 13.91958 23

. Calculating using computer,

p-value = P (F1,22 > 102.7473) = 9.4283 1010 ,

which is extremely small. We conclude that there is a highly significant


linear relationship between age and cholesterol level.

. For a size test, we read from standard statistical tables or computer soft-
ware:
()
Critical value F1,22
5% 4.301
1% 7.945

In either case, the F-ratio 102.7473 is much bigger than the critical value
and we should reject H0 at the stated significance level.

13
6.2 R2 statistic and F statistic
A significant result for an F test of the regression does not necessarily imply a big
R2 . The test is used to judge whether there is evidence of a linear relationship
between Y and X, but R2 is used to measure how good that relationship is.
To illustrate this point we below consider some examples.

EXAMPLE 8 Example 1.1 (contd)


R2 = 11.465/13.920 = 0.824, which shows a reasonably good fit by the SLR model.

EXAMPLE 9 Consider the following 12 observations on (X, Y ):

X 1 2 3 4 5 6 7 8 9 10 11 12
Y 0.700 0.193 3.423 6.553 7.680 10.245 1.223 7.338 12.285 6.564 8.711 6.144

Regressing Y on X yields the ANOVA table:

Source of variation df Sum of squares (SS) MS F p-value


Regression 54.8601 1 54.86010 5.152688
Residual 106.4689 10 10.64689
Total 161.3290 11

. p-value = P (F1,10 > 5.152688) = 0.0466, which is smaller than 0.05. (Alter-
(0.05)
natively, check that the F-ratio 5.153 > F1,10 = 4.965.) Hence the linear
relationship between X and Y is significant at the 5% level.

. The quantity R2 = 0.3401, which shows that the regression line is far from
being a good fit to the data.

This illustrates the earlier remark that a significant linear relationship is not nec-
essarily accompanied by a good fit of the regression line.
[ In fact, significant linear relationship means SLR is better than no SLR, but
does not mean that the SLR is a good fit. ]
The scatterplot of Y against X (with fitted regression line) gives the scatterplot
of the data, together with the fitted regression line. Clearly, the observed Y values
display a strong upwards trend as X increases, but the observations still suffer
from large fluctuations from the fitted line.

Generally, if T t(n 2) then T 2 F(1, n 2). This implies that the F test
rejects if and if the t-test rejects.

14
12
10
8
Y

6
4
2
0

2 4 6 8 10 12

Figure 3: Scatterplot of X vs Y.

7 Regression through the Origin


There are cases where no intercept restriction is imposed on the linear regression
model, such that the population regression model is

yi = 1 xi + i .

We can use this model when

. the experience tells us that there is no intercept;

. the estimated intercept 0 is insignificant in the process of estimation;

. both y1 , ..., yn and x1 , ..., xn have been centered.

Relying on the least square method we may obtain


P
n
xi yi
i=1
1 = .
Pn
xi2
i=1

15

You might also like