You are on page 1of 35

ANOVA Analysis of Variance

The model Y = X + accounts for variation in the responses Y , measured by Y T Y , in 2 ways: 1. systematic variation explained by the regressor variables x1 , . . . , xk in X 2. random variation in From the LS estimate Y = Y + Therefore YTY = (Y + )T (Y + ) = Y T Y + T + 2T Y But Y (i.e. T Y = 0) T Y = ((I H )Y )T HY = Y T (H H 2 )Y = 0

ANOVA continued
Therefore YTY i.e. where SSmod = Y T Y = Y T HY , the model sum of squares SSE = T = Y T (I H )Y , the residual sum of squares YTY = Y T Y + T = SSmod + SSE

Note that Y T Y is the squared length of the vector Y , & similarly for T

Computing SSmod and SSE


(1) use IML Y T HY (or Y T Y ) and Y T (I H )Y (2) use REG where the default intercept is supressed ( / noint) and a regressor x0 is included with x0 = 1 for all observations. Code quadratic model: x1 = x , x2 = x 2 proc reg data = ...; model y = x0 x1 x2 / noint; quit; Output Source Model Error Uncorrected Total DF 3 2 5 Sum of Squares 851.71686 5.73314 857.45000 Mean Square 283.90562 2.86657

Example 1: SSmod for the model E(y ) = 0


For n observations from the model E(y ) = 0 X so and
T

= (1, 1, . . . , 1)T = n = J , (an n n matrix of 1s) =


1 n 1 nJ

XTX XX
T 1

(X X )

and H = X (X T X )1 X T = Therefore SSmod = Y T HY = = = =


1 T n (Y JY ) 1 T T T n (X Y ) (X Y ) 1 yi )2 n( 2 ny

= the correction for the mean

Example 1 continued
SSmod for the model E(y ) = 0 again! For n observations from the model E(y ) = 0 it is easy to check that: 0 = y so Y = y X

Therefore

YTY

2 T = y X X 2 = ny

as before.

Example 2: one-way ANOVA (t means)

E(yij ) = i for i = 1, . . . , t ; j = 1, . . . , ni It is not too difcult to show H is an n n block diagonal matrix consisting of blocks 1 Jn ni i where J is a square matrix of 1s, and n =
t

ni . Hence

SSmod =
i =1

ni y 2 i

Comparing Models

Scientic questions of interest can often be formulated in

terms of whether particular terms are required in modelling E(y ).

Also in attempting to explain variation we look for the

simplest model that will adequately t data.

However diagnostic tests are essential to check that model

assumptions are not violated, for example by omitting important terms.

Examples of Comparing Models

In example 2 (two slides previous) the hypothesis H 0 : 1 = 2 = = t is tested by comparing the t of the model in Example 2 with that in Example 1. Model 1 is a special case of Model 2, and if this ts nearly as well as 2 then H0 is rejected.

Example 3
Fitting the model E(y ) = 0 + 1 x1 + 2 x2 + 3 x1 x2 We can ask: Does the rate at which E(y ) changes with x1 depend on the value of x2 ? (i.e. does the simpler model E(y ) = 0 + 1 x1 + 2 x2 adequately t the data?) The answer to the question is to test: H0 : 3 = 0 v Ha : 3 = 0. This could be done if we knew the distributional properties of 3 . So far we know the rst two moments of this distribution, but we need further assumptions to ensure the distribution is normal.

Example 4

For the model


2 E(y ) = 0 + 1 x1 + 2 x2 + 3 x2

the question whether E(y ) depends on x2 involves comparing this model to E(y ) = 0 + 1 x1 i.e. testing H0 : 2 = 3 = 0

General Formulation
Assuming the model Y = X + , Y = X0 + , N(0, 2 I ) N(0, 2 I ), C (X0 ) C (X ) (1) (2)

we wish to know if some simpler model gives an acceptable t to the data. Models (1) & (2) can be expressed as (1) E(Y ) C (X ); respectively. Often the subspace C (X0 ) C (X ) is described by a set of r linear equations L = 0 where L is an r p matrix of constants. (2) E(Y ) C (X0 )

Examples
Example 3 (above) L= 0 0 1 0 0 0 0 1

Example 1 (with n = 4) the 3 independent equations required are: 1 = 2 2 = 3 1 1 0 0 i.e. L = 0 1 1 0 0 0 1 1 3 = 4

Comparing models using SS


We assume Model (1) is true so that E(Y ) is estimated by HY . If Model (2) is true E(Y ) is also estimated by H0 Y , where H0 is the perpendicular projection onto C (X0 ). In this case the difference between the two estimates (H H0 )Y has expected value 0 and should be small. An obvious measure of the size of (H H0 )Y is its length Y T (H H0 )Y since Y Y
T

= SSmod(1) SSmod(2) = SSE(2) SSE(1) = SSmod + SSE

Exercise Y T (H H0 )Y is the length of (H H0 )Y since H H0 is idempotent. Prove this!

Comparing models using SS continued

1. The smaller SSE (equivalently the larger SSmod) the better the model, in that the regressors explain more of the variation in Y . 2. Comparing 2 models (a full and reduced model) if the difference in SSE (equivalently SSmod) is small then the reduced model may be considered adequate. 3. From 2 models we get two sets of estimates Yfull & Yreduced

and the difference in SSmod is the squared distance between these 2 estimates.

Example 2 (one-way ANOVA)


For testing H0 : 1 = 2 = = t in Example 2 the change in model SS is:
2 ni y 2 i ny =

ni (y i y )2

the familiar treatment sum of squares in a one-way ANOVA. Exercise Prove the equality above.
We need to know whether the change in error SS is large

enough to exclude the simpler model.


This requires some distributional results that follow from an

additional assumption N(0, 2 )


First: some notation to describe the SS that appear in

computer output

R( | ) notation
To emphasize the role of the regressors in E(y ) = 0 + 1 x1 + 2 x2 + + k xk put: R(0 , 1 , . . . , k ) = SSmod To see if the x s effect E(y ) compare this model to the constant mean model E(y ) = 0 for which R(0 ) = n y 2 = the correction for the mean (CM) The difference in SSmod for the 2 models is denoted by R(1 , 2 , . . . , k | 0 ). i.e. R(1 , 2 , . . . , k | 0 ) = R(0 , 1 , . . . , k ) R(0 )

R( | ) notation continued
When testing for a subset model obtained by omitting terms from a larger model, say testing H0 : r +1 = r +2 = = k = 0

where E(y ) = 0 + 1 x1 + + r xr + r +1 xr +1 + + k xk the difference in model (error) sums of squares R(0 , 1 , . . . , k ) R(0 , 1 , . . . , r ) and is denoted by R(r +1 , r +2 , . . . , k | 0 , 1 , . . . , r ) i.e. for coefcients and R( | ) = R(, ) R( ) which measures the effect (decrease in SSE) of adding to a model already containing .

SSreg
YTY = SSmod + SSE = R(0 , 1 , 2 , . . . , k ) + SSE = R(0 ) + R(1 , 2 , . . . , k | 0 ) + SSE i.e. where YTY = R(0 ) + SSreg + SSE

SSreg = R(1 , 2 , . . . , k | 0 )

SSreg is the amount by which the regressor variables reduce SSE compared to a constant mean.
1 ( yi )2 and yi2 1 yi )2 = Since R(0 ) = n n( get a partition of the total corrected SS,

(yi y )2 we

SSTo = SSreg + SSE

Computing SSreg using REG procedure


The ANOVA table contains SSreg when no variable representing an intercept is explicitly specied in the model statement. Without the option noint REG assumes an intercept is intended, and the table entry labeled Model is in fact SSreg. The total is consequently the corrected total SS. Code quadratic model: x1 = x , x2 = x 2 proc reg data = ...; model y = x1 x2; quit; Output Source Model Error Corrected Total DF 2 2 4 Sum of Squares 90.33886 5.73314 96.07200 Mean Square 45.16943 2.86657

ANOVA tables
Source Mean Regressors Residual Total (uncorrected) SS
1 JY YT n 1 T Y (H n J )Y T Y (I H )Y

df 1 p1 np

YTY

This is often presented as a decomposition of SSTo: Source Regressors Residual Total (corrected) SS Y T (H Y T (I H )Y
1 n J )Y

df p1 np

1 J )Y Y T (I n

Mean squares, MS, and F values are e.g. given by MSreg = SSreg p1 MSE = SSE np and F = MSreg MSE

E(SS)
A rst step in evaluating the effect on SSE of enlarging the regression space from C (X0 ) to C (X ) is to calculate expected sums of squares E(Y T (H H0 )Y ). The following are obtained by applying formula for E(Y T AY ) In general: E(Y T (H H0 )Y ) = (p p0 ) 2 + T X T (H H0 )X so E(SSE) = (n p ) 2 E(SSmod) = p 2 + T (X T X ) E(SSreg) = (p 1) + X (I
2 T T 1 n J )X

(1) (2) (3)

Notes: (1) has been proved already. 1 J )X does not involve the intercept and so is In (3) T X T (I n a quadratic form in the k = p 1 regression coefcients 1 , . . . , k

F tests
For the usual inferences we need to assume normal error terms, i.e. 1 , . . . , n are i.i.d. N(0, 2 ) . If e.g. E(y ) does not depend on the regressor variables then H0 : 1 = 2 = = k = 0 is true and E(MSreg) = 2 . If H0 is false then E(MSreg) > 2 . Hence, to test H0 use F = MSreg MSE

rejecting H0 if F 1. A precise test is obtained knowing that the normal error assumption implies that under H0 F F (p 1, n p )

The General Linear Hypothesis (GLH)


Suppose C (X0 ) C (X ) with dimensions p0 < p correspond to reduced and full models respectively. With X and X0 having independent columns then p and p0 are the number of parameters in the models. The GLH enables us test whether H0 : E(Y ) C (X0 ) assuming that C (X ) is large enough to ensure E(Y ) C (X ). The projection matrices are H (full model) and H0 (reduced model) and model SS are Y T HY Y T H0 Y full model reduced model

The GL Hypothesis continued


The appropriate F -value is F = Y T (H H0 )Y /(p p0 ) Y T (I H )Y /(n p )

and the p value for testing H0 given an observed value Fobs is p = Pr(F > Fobs | F F (p p0 , n p )) Equivalently, at signicance level , H0 is rejected if Fobs > F1(p p0 , n p ) where F1 (1 , 2 ) is the 100(1 ) percentage point of an F distribution with (1 , 2 ) df.

Example: Testing whether all i = 0


This is the overall F test of whether or not there is a regression relation between the response variable Y and the set of X variables. H0 : 1 = 2 = = p1 = 0 HA : Not all i (i = 1, , p 1) = 0

Hence, to test H0 use F = R(1 , 2 , . . . , i )/(p 1) MSE

Decision Rule: If F F (1 - ; p 1, n p ), conclude H0 . If F > F (1 - ; p 1, n p ), conclude HA .

Example: Testing whether a single i = 0


Do we need xi in the model? H0 : i = 0 v Ha : i = 0 . The t test uses t= The F test uses R(i | 0 , . . . , i 1 , i +1 , . . . , p )/1 MSE Decision Rule: If F F (1 - ; 1, n p ), conclude H0 . If F > F (1 - ; 1, n p ), conclude HA . F = In fact F = t2 i se(i )

Example: Testing whether some i = 0


H0 : q = q +1 = p1 = 0 Ha : Not all of the i in H0 equal zero. The F test: F = R(q , . . . , p1 | 0 , 1 , . . . , q 1 )/(p q ) MSE

Decision Rule: If F F (1 - ; p q , n p ), conclude H0 . If F > F (1 - ; p q , n p ), conclude HA . Example: Do we need a group of terms? For example with E(y ) = 0 + 1 x1 + + 5 x5 we could test H0 : 3 = 4 = 5 = 0. The F test has (3, n 6) df and F = R(3 , 4 , 5 | 0 , 1 , 2 )/3 MSE

More complicated example

For the model E(y ) = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 we can test 1 = 2 3 = 24 The reduced model is E(y ) = 0 + 1 x1 + 1 x2 + 3 x3 + 0.53 x 4 = 0 + 1 (x1 + x2 ) + 3 (x3 + 0.5x4 ) We can now perform a regression (with an intercept) on the variables x1 + x2 and x3 + 0.5x4

Example (Draper & Smith pg 254)


A pilot plant investigation headed Predicting Stoichiometric CaHPO4 2H2 O was carried out and 3 variables, x1 , x2 , x3 , were investigated to check their ability to predict the response y from
2 2 2 + 33 x3 + 22 x2 E(y ) = 0 + 1 x1 + 2 x2 + 3 x3 + 11 x1

= yield as a percentage of predicted yield

+ 12 x1 x2 + 13 x1 x3 + 23 x2 x3 with the usual normality and independence assumptions.

Draper & Smith continued

Reduced models to consider Two obvious hypotheses are: 1. Are the 2nd order terms required? 2. Is x2 required? This is suggested by the fact that coefcients of terms involving x2 are of borderline signicance. ANOVA tables for the full model and the 2 reduced models are obtained from the SAS procedure GLM.

Draper & Smith continued


Full Model Source Model Error Corrected Total Linear Terms Only Source Model Error Corrected Total Drop x2 Terms Source Model Error Corrected Total DF 9 10 19 DF 3 16 19 DF 5 14 19 Sum of Squares 2643.335755 124.773745 2768.109500 Sum of Squares 1829.799437 938.310063 2768.109500 Sum of Squares 2626.025281 142.084219 2768.109500 Mean Square 293.703973 12.477375

Mean Square 609.933146 58.644379

Mean Square 525.205056 10.148873

Draper & Smith continued


Understanding ANOVA tables Consider the table for the Full Model. 1. The table decomposes the corrected total SSTo, and not Y T Y . Hence what SAS calls the Model SS is in fact SSreg = R(1 , 2 , 3 , 11 , 22 , 33 , 12 , 23 , 31 | 0 ) 2. Degrees of freedom are labelled DF. For the Model row this is the rank of the matrix consisting of columns for the regressor variables (excluding the intercept). Since the columns are linearly independent this matrix has full rank 9. 3. The F statistic is a quotient of 2 terms, each of the form SS/DF. This is called a Mean Square so SS DF 4. DF for the Corrected Total this will be MS = rank I 1 n J) = n 1

Draper & Smith continued

Answering the questions 1. Comparing model C (X0 ) with C (X ) we can use any of:
increase in SSmod increase in SSreg decrease in SSE

to calculate the numerator of F .

2. The denominator, MSE, is based on SSE for the full model, C (X )

Draper & Smith continued


Continuing to answer the questions 1. Linear adequate? F = = Pr(F 10.867 | (2643.34 1829.80)/(9 3) 124.77/10 135.59 = 10.867 12.477 F F (6, 10)) = 0.0007

so there is strong evidence to include the 2nd order terms. 2. Terms in x2 (2643.34 2626.03)/(9 5) F = 124.77/10 4.3275 = 0.347 = 12.477 Pr(F 0.347 | F F (4, 10)) = 0.84

so there is no evidence that terms in x2 are required.

Draper & Smith continued


Fitted Equation
2 y = 76.420 + 5.503 x1 + 10.207 x3 + 0.667 x1 2 1.463 x1 x3 7.343 x3

The points marked are the experimental points in this designed experiment.

You might also like