You are on page 1of 6

QMDS 202 Data Analysis and Modeling

Chapter 17 Multiple Regression


Model and Required Conditions
For k independent variables (predicting variables) x1, x2, , xk, the multiple linear
regression model is represented by the following equation:
y 0 1 x1 2 x2 ... k x k

where 1, 2, , k are population regression coefficients of x1, x2, , xk


respectively, 0 is the constant term, and (the Greek letter epsilon) represents the
random term (also called the error variable) the difference between the actual value
of Y and the estimated value of Y based on the values of the independent variables.
The random term thus accounts for all other independent variables that are not
included in the model.
Required Conditions for the Error Variable:
1. The probability distribution of the error variable is normal.
2. The mean of the error variable is 0.
3. The standard deviation of is , which is constant for each value of x.
4. The errors are independent.
The general form of the sample regression equation is expressed as follow:
y b0 b1 x1 b2 x 2 ... bk x k

where b1, b2, , bk are sample linear regression coefficients of x1, x2, , xk
respectively and b0 is the constant of the equation.
For k = 2, the sample regression equation is y b0 b1 x1 b2 x 2 where b0, b1, and b2
can be found by solving a system of three normal equations:

y b0 n b1x1 b2 x2

2
x1 y b0 x1 b1x1 b2 x1 x2

2
x2 y b0 x2 b1x1 x2 b2 x2

Example 1
x1

x2

x1 y

x2 y

x1 x 2

1
5
8
6
3

200
700
800
400
100

100
300
400
200
100

100
1500
3200
1200
300

20000
210000
320000
80000
10000

200
3500
6400
2400
300

x12
1
25
64
36
9

x 22
40000
490000
640000
160000
10000

82.99
305.20
394.73
241.55
95.92

10
33

600
2800
n=6

400
1500

4000
10300

240000
880000

6000
18800

100
235

360000
1700000

379.61
1500

1500 6b0 33b1 2800b2

10300 33b0 235b1 18800b2

880000 2800b 18800b 1700000b


0
1
2

By solving the above system of normal equations, we should find the following:
b0 = 6.397

b1 = 20.492

b2 = 0.280

The sample multiple linear regression equation is:


y 6.397 20.492 x1 0.280 x 2

Interpretation of the Regression Coefficients


b1: the approximate change in y if x1 is increased by 1 unit and x2 is held constant.
b2: the approximate change in y if x2 is increased by 1 unit and x1 is held constant.
In Example 1, if x1 is increased by 1 unit and x2 is held constant, then the approximate
change in y therefore will be 20.492 units.
Point Estimate
In Example 1, suppose x1 = 4 and x2 = 500, then the point estimate of y equals:
y 6.397 20.492(4) 0.280(500) 228.61

The Standard Error of Estimate in Multiple Regression Model


y i y i
s
n k 1

where y i = the observed y value in the sample


i = the estimated y value calculated from the multiple regression equation
y
In Example 1,

( y i y i ) 2

(17.01)2
(-5.2)2
(5.27)2
(-41.55)2
(4.08)2
(20.39)2

2502.954
28.88
6 2 1

2502.954
Note: s is the point estimate of (the standard deviation of the error variable .)
Testing the Validity of the Model The Analysis of Variance (ANOVA) Test
Lets consider a simple linear regression model:
y
y = y / n = the mean of y

*
*
*

x
( y i y ) ( y i y ) ( y i y i )
( y i y ) ( y i y ) ( y i y i )

y i y = total deviations
y i y = total deviations of estimated values from the mean
y i y i = error deviations = ei
ei y i y i = the residual of the ith data point
( y i y ) 2 ( y i y ) 2 ( y i y i ) 2

SST = SSR + SSE


SST = total sum of squared deviations = total variation
SSR = sum of squares resulting from regression = explained variation
SSE = sum of squares resulting from sampling error = unexplained variation
The ANOVA table of a regression model:
Source
Regression
Residual
Total

df
k
nk1
n1

SS
SSR
SSE
SST

MS
MSR
MSE

MSR = SSR / k
MSE = SSE / (n k 1)
Note. MSE = s2
The ANOVA test of the regression model in Example 1:
3

F
MSR/MSE

(Refer to the associated computer output of this example)


H0: The regression model is not significant (1 = 2 = = k = 0)
H1: The regression model is significant (At least one i 0)
= 0.05
df1 = k = 2
df2 = n k 1 = 6 2 1 = 3
Critical value = 9.55
Test statistic = 55.44 > 9.55 Reject H0
We can also use the p-value provided by the output to arrive at the conclusion:
p-value = 0.0043 < = 0.05 Reject H0
The regression model is significant. (There is at least one independent variable that
can explain Y.)
II The t-Tests for Regression Coefficients (Slopes)
A t-test is used to determine if there is a meaningful relationship between the
dependent variable and one of the independent variables.
In Example 1, the t-test for X1 (again refer to the computer output of this example):
H0: X1 is not a significant independent variable (1 = 0)
H1: X1 is a significant independent variable (1 0)
= 0.05
/2 = 0.025
df = n k 1 = 6 2 1 = 3
Critical values = 3.182

Reject H0 if TS < 3.182 or TS > 3.182


b (1 ) 0
TS 1
where S b1 = estimated standard deviation of b1
S b1
TS

20.492 0
3.48 > 3.182 Reject H0
5.882

p-value approach:
p-value = 0.04 < = 0.05 Reject H0
The slope 1 is significant, that is, there is a meaningful relationship between X 1
and Y.
The t-test for X2:
H0: X2 is not a significant independent variable (2 = 0)
H1: X2 is a significant independent variable (2 0)
= 0.05
/2 = 0.025
df = n k 1 = 6 2 1 = 3
Critical values = 3.182
Reject H0 if TS < 3.182 or TS > 3.182
b ( 2 ) 0
TS 2
where S b2 = estimated standard deviation of b2
S b2
TS

0.280 0
4.089 > 3.182 Reject H0
0.069

p-value approach:
p-value = 0.026 < = 0.05 Reject H0
X2 is also a significant independent variable.
In case there are some insignificant independent variables in the model (the p-values
of some regression coefficients are bigger than ), we should take out the most
insignificant variable from the model (the one with the highest p-value) and run the
regression function once again by using only the remaining variables. Then we
observe the p-values of the coefficients in this new model and repeat the same
procedure (if necessary) until all the p-values are less than .
The Coefficient of Multiple Determination (R2)
R2

SSR exp lained var iation

SST
total var iation

In Example 1,

R2

92497
0.974
92497 2503

We can conclude that 97.4% of the variation in Y is explained by using X 1 and X2 as


independent variables.
The Adjusted R2
The adjusted R2 has been adjusted to take into account the sample size and the number
of independent variables. The rationale for this statistic is that, if the number of
independent variables k is large relative to the sample size n, the unadjusted R2 value
may be unrealistically high.
Adjusted R2 = 1

SSE /( n k 1)
SST /( n 1)

If n is considerably larger than k, the actual and adjusted R2 values will be similar. But
if SSE is quite different from 0 and k is large compared to n, the actual and adjusted
values of R2 will differ substantially.
2
Radj

= 1

In Example 1,

SSE /( n k 1)
n 1

1 1 R 2
=
SST /( n 1)
n k 1

2
Radj

= 1

2502.636 / 3
834.212
1
0.956
95000 / 5
19000

The Multicollinearity Problem in Multiple Regression Model

Multicollinearity is the name given to the situation in which two independent


variables (e.g. Xi and Xj) are closely correlated. If this is the case, the values of the
two regression coefficients (bi and bj) tend to be unreliable and an estimate made with
an equation that uses these values also tends to be unreliable. This is because, if X i
and Xj are closely correlated, values in Xj dont necessarily remain constant while Xi
changes. If two independent variables are closely correlated, that is, if their
correlation coefficient (r) is close to 1, a simple solution to solve the
multicollinearity problem is to use just one of them in a multiple regression model.
As a rule of thumb, if r of Xi and Xj is bigger than or equal to 0.8, then we
should drop one of them from the regression model.
In Example 1,

r of X1 and X2 = 0.741 is not bigger than 0.8


X1 and X2 can be used together in the model.

Interval Estimates for Population Regression Coefficients


bi t / 2 S bi

The confidence interval of i is found by:


df = n k 1

In Example 1, the 95% confidence interval of 1 is:


20.492 3.182 5.882
= (1.77 to 39.21)

You might also like