You are on page 1of 18

2

MATH2831/2931
Linear Models/ Higher Linear Models.

October 16, 2013

Week 9 Lecture 3 - This lecture:

Handling categorical predictor variables in the linear model.

Categorical variables with two levels.

Examples: inflation rates and central bank independence,


growth of mussels.

Categorical variables with more than two levels.

Example: real estate prices.

Week 9 Lecture 3 - Categorical predictors

A categorical predictor variable is one which takes values


which are labels distinguishing groups among the observations.

Example: data set on inflation rates and central bank


independence. Variables:
1.
2.
3.
4.

INF (inflation rate)


QUES (a questionnaire measure of independence)
LEGAL (a legal measure of independence)
DEV (a variable which is 1 for developed economies, 0 for
developing economies).

Variable DEV is an example of a categorical variable. Values 0


and 1 are arbitrary: the numeric values of DEV have no
meaning, but are labels to distinguish two groups.

Could easily have defined the variable DEV to have the value
A for developed economies and B for developing economies.

Week 9 Lecture 3 - Categorical predictors

Another example: trial of a new drug for patients with high


blood pressure.

For a group of patients with high blood pressure, assign


patients to either a treatment group (receive the new drug) or
a control group (receive the standard treatment for blood
pressure, an older drug perhaps).

Measure change in blood pressure for each patient after one


month (response variable).

Categorical predictor with values treatment and control


dividing the observations into two groups, maybe additional
quantitative predictors to be adjusted for (age for instance).

Week 9 Lecture 3 - Categorical variables

We could associate the different levels of a categorical


variables with numbers and introduce the resulting
quantitative variable into a linear model.

But if the actual numeric values are meaningless and just


labels, is this sensible?

Consider simplest case: response y , single quantitative


predictor x, single categorical variable z.

Say z can take on values A or B.

Week 9 Lecture 3 - Categorical variables

Define binary (zero/one) variable w from z as



1 if z is A
w =
0 if z is B
Now consider the multiple linear regression model
y i = 0 + 1 x i + 2 wi + i ,
i = 1, ..., n, where 0 , 1 and 2 are unknown parameters and
i , i = 1, ..., n is a collection of zero mean errors uncorrelated
with a common variance 2 .

What is the mean structure within the two different groups for
this model?

Week 9 Lecture 3 - Categorical variables

In group A, we have wi = 1, and the model reduces to


yi = 0 + 2 + 1 xi + i .
In group B, we have wi = 0, and the model reduces to
yi = 0 + 1 xi + i .
So with a binary variable coding for the two levels of the
categorical variable, we are postulating a linear relationship
between the mean of y and x, with a common slope but
different intercept for the line within groups.

Testing for an effect due to z (differences between the groups


after adjusting for x): simply look at the partial t statistic for
2 .

Week 9 Lecture 3 - Categorical variables

In the model involving x and the binary variable w , there is a


common slope for the two lines which describe the
relationship between the mean of y and x within groups
(although possibly a different intercept).

Group effect does not depend on the level of x: we say there


is no interaction between w and x.

More general model: allow different linear relationship


between mean of y and x (different slope and intercept) for
the two groups.

Consider a multiple linear regression model with x and w as


predictors, as well as the product of x and w , x w .

10

Week 9 Lecture 3 - Categorical variables

Model is
y i = 0 + 1 x i + 2 wi + 3 x i wi + i .

What happens now within the two groups A and B?


In group A (wi = 1), we have
yi = 0 + 2 + (1 + 3 )xi .
In group B (wi = 0), we have
yi = 0 + 1 xi .

So by doing a multiple linear regression with predictors x, w


and x w , we have separate linear relationships between the
mean of y and x within the two groups.
Test for interaction
(ie. are parallel lines a good description of the relationship
between mean of y and x for the two groups?)
Simply look at partial t statistic for 3 .

11

Week 9 Lecture 3 - Categorical variables

Testing for an effect due to the categorical variable in the


presence of x: test
H0 : 2 = 3 = 0
against the alternative
H1 : Not both 2 , 3 are zero.

Use an F test to do this as in previous lectures.


How is fitting the model just described different from fitting
separate simple linear regressions between y and x within the
two groups?
In the model with predictors x, w , and x w there is a
separate linear relationship between mean of y and x for the
two groups, but an assumption of a common error variance.
If common error variance for the two groups is not reasonable,
then we can fit separate linear regressions within groups.

12

Week 9 Lecture 3 - Example: inflation and central bank


independence

Recall model to examine relationship between inflation and


central bank independence

Response y is inflation rate (INF)


Predictor: QUES questionnaire measure of independence
Predictor: LEGAL legal measure of independence
Predictor: DEV binary indicator discriminating
developing/developed economies

First we fit a model with predictors QUES and DEV.

Note: Here we postulate a linear relationship between the


expected inflation rate and QUES for both developed and
developing economies with the same slope but a different
intercept within the two groups.

13

Week 9 Lecture 3 - Example: inflation and central bank


independence
The output which results after fitting this model is shown below.

The regression equation is


INF = 59.6 4.07QUES 22.8DEV .
Predictor
Constant
QUES
DEV

Coef
59.62
-4.072
-22.79

StDev
14.32
2.753
12.13

T
4.16
-1.48
-1.88

P
0.001
0.156
0.076

S = 22.65 R-Sq = 43.0% R-Sq(adj) = 37.0%

14

Week 9 Lecture 3 - Example: inflation and central bank


independence
Analysis of Variance
Source
Regression
Residual Error
Total

DF
2
19
21

SS
7345.2
9750.0
17095.3

MS
3672.6
513.2

F
7.16

P
0.005

Source DF Seq SS
QUES
1
5533.4
DEV
1
1811.8
Assuming that this model is adequate, is there evidence for a
different intercept?
Partial p-value for H0 : 2 = 0 against H1 : 2 6= 0 is 0.076.
i.e. Result is uncertain: it is not clear whether separate means are
required.

15

Week 9 Lecture 3 - Example: inflation and central bank


independence

Now consider a completely separate linear relationship within


each DEV group.
i.e. fit a model with INF as the response and QUES, DEV and
QUES*DEV as predictors.
The output is shown below.
The regression equation is
INF = 66.8 5.64QUES 57.1DEV + 5.31QUES DEV
Predictor
Constant
QUES
DEV
QUES*DEV

Coef
66.80
-5.641
-57.06
5.314

StDev
16.58
3.300
41.03
6.074

T
4.03
-1.71
-1.39
0.87

P
0.001
0.105
0.181
0.393

S = 22.79 R Sq = 45.3% R Sq(adj) = 36.2%

16

Week 9 Lecture 3 - Example: inflation and central bank


independence
Analysis of Variance
Source
DF SS MS
F
P
Regression
3
7742.9 2581.0 4.97 0.011
Residual Error 18 9352.3
519.6
Total
21 17095.3
Source
DF Seq SS
QUES
1
5533.4
DEV
1
1811.8
QUES*DEV 1
397.7
Assuming this model is an appropriate one, is there an interaction
between QUES and DEV?
P-value of testing H0 : 3 = 0 against the alternative H1 : 3 6= 0
is 0.393.
i.e. Seems no real evidence of any interaction.

17

Week 9 Lecture 3 - Example: inflation and central bank


independence

We can test whether the level of DEV seems to have any


relationship to inflation rate in the presence of QUES by testing
H0 : 2 = 3 = 0
against the alternative
H1 : 2 , 3 not both zero.

Value of test statistic is:


(1811.8 + 397.7)/2
= 2.13,
519.6
Which under H0 is an F random variable with 2 and
n 4 = 18 degrees of freedom.
Upper 5% point is approx 3.55
DEV does not seem to be related to inflation rate in the
presence of QUES.

18

Week 9 Lecture 3 - Example: inflation and central bank


independence
In all these hypothesis tests we are assuming that the model
we have fitted is appropriate.
This is in fact very doubtful here.

Standardized residuals against fitted values for model involving QUES, DEV
and QUES*DEV.

19

Week 9 Lecture 3 - Example: inflation and central bank


independence

Standardized residuals against DEV for model involving QUES, DEV


and QUES*DEV.
Constancy of error variance is a problem: also depends on DEV.
Better model may be to fit completely independent regression lines.

You might also like