You are on page 1of 96

Linear Regression Analysis

Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R
2
and Adjusted R
2

Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity

Gaurav Garg (IIM Lucknow)
Smoking and Lung Capacity
Suppose, for example, we want to investigate the
relationship between cigarette smoking and lung
capacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
Cigarettes (X) Lung Capacity (Y)
0 45
5 42
10 33
15 31
20 29
Gaurav Garg (IIM Lucknow)
Scatter plot of the data






We can see that as smoking goes up, lung
capacity tends to go down.
The two variables change the values in opposite
directions.

0
20
40
60
0 10 20 30
Lung Capacity
Gaurav Garg (IIM Lucknow)
Height and Weight
Consider the following data of heights and weights of 5
women swimmers:
Height (inch): 62 64 65 66 68
Weight (pounds): 102 108 115 128 132
We can observe that weight is also increasing with
height.


0
50
100
150
60 65 70
Gaurav Garg (IIM Lucknow)
Sometimes two variables are related to each
other.
The values of both of the variables are paired.
Change in the value of one affects the value of
other.
Usually these two variables are two attributes of
each member of the population
For Example:
Height Weight
Advertising Expenditure Sales Volume
Unemployment Crime Rate
Rainfall Food Production
Expenditure Savings
Gaurav Garg (IIM Lucknow)
We have already studied one measure of relationship
between two variables Covariance
Covariance between two random variables, X and Y is
given by

For paired observations on variables X and Y,



) ( ) ( ) ( ) , ( Y E X E XY E Y X Cov
XY
= = o

=
= =
n
i
i i XY
y y x x
n
Y X Cov
1
) )( (
1
) , ( o
y
x
x x
y y
Gaurav Garg (IIM Lucknow)
Properties of Covariance:
Cov(X+a, Y+b) =Cov(X, Y) [not affected by change in location]
Cov(aX, bY) =ab Cov(X, Y) [affected by change in scale]
Covariance can take any value from - to +.
Cov(X,Y) >0 means X and Y change in the same direction
Cov(X,Y) <0 means X and Y change in the opposite direction
If X and Y are independent, Cov(X,Y) =0 [other way may not be true]
It is not unit free.
So it is not a good measure of relationship between two
variables.
A better measure is correlation coefficient.
It is unit free and takes values in [-1,+1].

Gaurav Garg (IIM Lucknow)
Correlation
Karl Pearsons Correlation coefficient is given by





When the joint distribution of X and Y is known


When observations on X and Y are available

) ( ) (
) , (
) , (
Y Var X Var
Y X Cov
Y X Corr r
XY
= =
2 2 2 2
)] ( [ ) ( ) ( , )] ( [ ) ( ) (
) ( ) ( ) ( ) , (
Y E Y E Y Var X E X E X Var
Y E X E XY E Y X Cov
= =
=

= =
=
= =
=
n
i
i
n
i
i
n
i
i i
y y
n
Y Var x x
n
X Var
y y x x
n
Y X Cov
1
2
1
2
1
) (
1
) ( , ) (
1
) (
) )( (
1
) , (
Gaurav Garg (IIM Lucknow)
Properties of Correlation Coefficient

Corr(aX+b, cY+d) = Corr(X, Y),
It is unit free.
It measures the strength of relationship on a
scale of -1 to +1.
So, it can be used to compare the relationships of
various pairs of variables.
Values close to 0 indicate little or no correlation
Values close to +1 indicate very strong positive
correlation.
Values close to -1 indicate very strong negative
correlation.
Gaurav Garg (IIM Lucknow)
Scatter Diagram
Positively Correlated
Negatively Correlated
Weakly Correlated Strongly Correlated Not Correlated
X
Y
Gaurav Garg (IIM Lucknow)
Correlation Coefficient measures the strength of
linear relationship.
r = 0 does not necessarily imply that there is no
correlation.
It may be there, but is not a linear one.
x
y
x
y
Gaurav Garg (IIM Lucknow)
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY

-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
957 . 0
4450 56 . 1
75 . 79
) ( ) (
) , (
=

= = =
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
x x
y y
2
) ( x x
2
) ( y y
) )( ( y y x x
Gaurav Garg (IIM Lucknow)
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.20
y
125
105
65
85
75
80
50
55
640
x
2

1.5625
3.0625
5.0625
4.0000
6.2500
5.0625
7.2500
6.2500
38.54
y
2

15625
11025
4225
7225
5625
6400
2500
3025
55650
x.y
156.25
183.75
146.25
170.00
187.50
180.00
135.00
137.50
1296.25
( )
,
2
2


=
n
x
x SSX
( )
,
2
2


=
n
y
y SSY
( )( )


=
n
y x
xy SSXY
SSX = 1.56
SSY = 4450
SSXY= -79.75
Alternative Formulas for Sum of Squares
957 . 0
4450 56 . 1
75 . 79
) ( ) (
) , (
=

= = =
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
Gaurav Garg (IIM Lucknow)
Smoking and Lung Capacity Example
Cigarettes
(X)

XY
Lung
Capacity
(Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
2
X
2
Y
Gaurav Garg (IIM Lucknow)
( )
2 2
(5)(1585) (50)(180)
(5)(750) 50 (5)(6680) 180
7925 9000
(3750 2500)(33400 32400)
1075
.9615
1250 (1000)
xy
r

=
( (

= =
Gaurav Garg (IIM Lucknow)
Regression Analysis
Having determined the correlation between X and Y, we
wish to determine a mathematical relationship between
them.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s)
Explain the impact of changes in an independent
variable on the dependent variable

Gaurav Garg (IIM Lucknow)
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Gaurav Garg (IIM Lucknow)
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Strong relationships
Weak relationships
Gaurav Garg (IIM Lucknow)
Types of Relationships
Y
X
Y
X
No relationship
Gaurav Garg (IIM Lucknow)
Simple Linear Regression Analysis
The simplest mathematical relationship is
Y = a + bX + error (linear)
Changes in Y are related to the changes in X
What are the most suitable values of
a (intercept) and b (slope)?

X
Y
X
Y
y =a +b.x
}a
1
b
Gaurav Garg (IIM Lucknow)
X
Y
(x
i
, y
i
)
y
i

x
i

Method of Least Squares
i
bx a+
bX a+
The best fitted line would be for which all the
ERRORS are minimum.
error
Gaurav Garg (IIM Lucknow)
We want to fit a line for which all the errors are
minimum.
We want to obtain such values of a and b in
Y = a + bX + error for which all the errors are
minimum.
To minimize all the errors together we minimize
the sum of squares of errors (SSE).





=
=
n
i
i i
bX a Y SSE
1
2
) (
Gaurav Garg (IIM Lucknow)
To get the values of a and b which minimize SSE, we
proceed as follows:








Eq (1) and (2) are called normal equations.
Solve normal equations to get a and b



) 1 (
0 ) ( 2 0
1 1
1


= =
=
+ =
= =
c
c
n
i
i
n
i
i
i
n
i
i
X b na Y
bX a Y
a
SSE
) 2 (
0 ) ( 2 0
1
2
1 1
1


= = =
=
+ =
= =
c
c
n
i
i
n
i
i i
n
i
i
i i
n
i
i
X b X a X Y
X bX a Y
b
SSE
Gaurav Garg (IIM Lucknow)
Solving above normal equations, we get



= = =
= =
+ =
+ =
n
i
i
n
i
i
n
i
i i
n
i
i
n
i
i
X b X a X Y
X b na Y
1
2
1 1
1 1
( )( )
( )
SSX
SSXY
X X
X X Y Y
X X
X Y X Y n
b
n
i
i
n
i
i i
n
i
i
n
i
i
n
i
i
n
i
i
n
i
i i
=


=
|
.
|

\
|

|
.
|

\
|
|
.
|

\
|



=
=
= =
= = =
1
2
1
2
1 1
2
1 1 1
X b Y a =
Gaurav Garg (IIM Lucknow)
The values of a and b obtained using least squares
method are called as least squares estimates (LSE)
of a and b.
Thus, LSE of a and b are given by


Also the correlation coefficient between X and Y is

.

SSX
SSXY
b , X b Y a = =
SSY
SSX
b
SSY
SSX
SSX
SSXY
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
XY

) ( ) (
) , (
= = = =
Gaurav Garg (IIM Lucknow)
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY

-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
x x
y y
2
) ( x x
2
) ( y y
) )( ( y y x x
. 80 , 15 . 2 = = Y X
Gaurav Garg (IIM Lucknow)
957 . 0 = =
SSY SSX
SSXY
r
12 . 51

= =
SSX
SSXY
b
91 . 189

= = X b Y a
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
140
120
100
80
60
40

X Y 12 . 51 91 . 189

= is Line Fitted
Gaurav Garg (IIM Lucknow)


189.91 is the estimated mean value of Y when
the value of X is zero.
-51.12 is the change in the average value of Y as
a result of a one-unit change in X.
We can predict the value of Y for some given
value of X.
For example at X=2.15, predicted value of Y is
189.91 51.12 x 2.15= 80.002
X Y 12 . 51 91 . 189

= is Line Fitted
Gaurav Garg (IIM Lucknow)

Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R
2
)
using the residuals.
R
2
is used to examine the adequacy of the fitted
linear model to the given data.
i i i
Y Y e

= : Residuals
Gaurav Garg (IIM Lucknow)
Coefficient of Determination
X
Y
Y
( ) Y Y
( ) Y Y

( ) Y Y

=
=
n
i
i
Y Y SST
1
2
) ( : Squares of Sum Total

=
=
n
i
i
Y Y SSR
1
2
)

( : Squares of Sum Regression

=
=
n
i
i i
Y Y SSE
1
2
)

( : Squares of Sum Error


Also, SST =SSR +SSE
Gaurav Garg (IIM Lucknow)
The fraction of SST explained by Regression is given by R
2
R
2
= SSR/ SST = 1 (SSE/ SST)
Clearly, 0 R
2
1
When SSR is closed to SST, R
2
will be closed to 1.
This means that regression explains most of the variability
in Y. (Fit is good)
When SSE is closed to SST, R
2
will be closed to 0.
This means that regression does not explain much
variability in Y. (Fit is not good)
R
2
is the square of correlation coefficient between X and
Y. (proof omitted)


Gaurav Garg (IIM Lucknow)
r =1
r =-1
R
2
=1
Perfect linear
relationship

100% of the variation
in Y is explained by X
0 <R
2
<1
Weak linear
relationships

Some but not all of
the variation in Y is
explained by X
R
2
=0
No linear
relationship

None of the
variation in Y is
explained by X
Gaurav Garg (IIM Lucknow)








Coefficient of Determination: R
2
=(4450-370.5)/4450 =0.916
Correlation Coefficient: r =-0.957
Coefficient of Determination = (Correlation Coefficient)
2



X Y
1.25 125 126.0 45 -1 46 2025 1 2116
1.75 105 100.5 25 4.5 20.5 625 20.25 420.25
2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01
2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29
2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29
2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01
2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61
2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41
17.20 640 4450 370.54 4079.4
6
Y

) ( Y Y
)

( Y Y
)

( Y Y
2
) ( Y Y
2
)

( Y Y
2
)

( Y Y
Gaurav Garg (IIM Lucknow)
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.


Calculate the sample regression line and describe what the
coefficients tell you about the relationship between the two
variables.
Y = -24.709 + 0.967 X and R
2
= 0.768

TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
Overweight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
Predicted Y
Gaurav Garg (IIM Lucknow)
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
It is given by
2
)

(
2
1
2

=

=
n
Y Y
n
SSE
S
n
i
i i
YX
Gaurav Garg (IIM Lucknow)
Assumptions
The relationship between X and Y is linear
Error values are statistically independent
All the Errors have a common variance.
(Homoscedasticity)
Var(e
i
)=o
2
, where
E(e
i
)= 0
No distributional assumption about errors is
required for least squares method.


i i i
Y Y e

=
Gaurav Garg (IIM Lucknow)
Linearity
Not Linear
Linear
X
r
e
s
i
d
u
a
l
s

X
Y
X
Y
X
r
e
s
i
d
u
a
l
s

Gaurav Garg (IIM Lucknow)
Independence
Not Independent Independent
X
X
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

X
r
e
s
i
d
u
a
l
s

Gaurav Garg (IIM Lucknow)
Equal Variance
Unequal variance
(Heteroscadastic)
Equal variance
(Homoscadastic)
X
X
Y
X
X
Y
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Gaurav Garg (IIM Lucknow)
TV Watching Weight Gain Example
Scatter Plot of X and Y





Scatter Plot of X and Residuals
-12.00
-10.00
-8.00
-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
0 5 10 15 20 25 30 35 40 45
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
0 5 10 15 20 25 30 35 40 45
Gaurav Garg (IIM Lucknow)
The Multiple Linear Regression Model
In simple linear regression analysis, we fit linear relation
between
one independent variable (X) and
one dependent variable (Y).
We assume that Y is regressed on only one regressor
variable X.
In some situations, the variable Y is regressed on more
than one regressor variables (X
1
, X
2
, X
3
, ).
For EXample:
Cost -> Labor cost, Electricity cost, Raw material cost
Salary -> Education, EXperience
Sales -> Cost, Advertising EXpenditure
Gaurav Garg (IIM Lucknow)
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand

Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X
1
: Price (in $)
X
2
: Advertising Expenditure ($100s)


Data are collected for 15 weeks
Gaurav Garg (IIM Lucknow)
Week
Pie
Sales
Price
($)
Advertising
($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Gaurav Garg (IIM Lucknow)
Using the given data, we wish to fit a linear
function of the form:

where
Y: Pie sales (units per week)
X
1
: Price (in $)
X
2
: Advertising Expenditure ($100s)
Fitting means, we want to get the values of
regression coefficients denoted by
Original values of s are not known.
We estimate them using the given data.
,
2 2 1 1 0 i i i i
X X Y + + + =
. 15 , , 2 , 1 = i
Gaurav Garg (IIM Lucknow)
The Multiple Linear Regression Model
Examine the linear relationship between
one dependent (Y) and
two or more independent variables (X
1
, X
2
, , X
k
).
Multiple Linear Regression Model with k
Independent Variables:


i ki k i i i
X X X Y + + + + + =
2 2 1 1 0
Intercept
Slopes Random Error
. , , 2 , 1 n i =
Gaurav Garg (IIM Lucknow)
Multiple Linear Regression Equation
Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables


ki k i i i
X b X b X b b Y + + + + =
2 2 1 1 0

Estimated
value
Estimates of slopes
Estimate of
intercept
. , , 2 , 1 n i =
Gaurav Garg (IIM Lucknow)
Multiple Regression Equation
EXample with two independent variables

Y
X
1
X
2
2 2 1 1 0

X b X b b Y + + =
Gaurav Garg (IIM Lucknow)
Estimating Regression Coefficients
The multiple linear regression model

In matriX notations






or

|
|
|
|
|
.
|

\
|
+
|
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
n
k
nk n n
k
k
n
X X X
X X X
X X X
Y
Y
Y
c
c
c
|
|
|
|

2
1
2
1
0
2 1
2 22 21
1 12 11
2
1
1
1
1
X Y + =
,...,n , ,i X X X Y
i ki k i i i
2 1
2 2 1 1 0
= + + + + + =
Gaurav Garg (IIM Lucknow)
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(c
i
)=o
2

In long run, mean effect of random errors is zero.
E(c
i
)= 0.
No Assumption on distribution of Random errors
is required for least squares method.



Gaurav Garg (IIM Lucknow)
In order to find the estimate of , we minimize




We differentiate S() with respect to and equate
to zero, i.e.,

This gives

b is called least squares estimator of .
X X Y X Y- Y
X (Y ) X (Y S(
n
i
i
' '
+
' ' '
=

'
=
'
= =

=
2
) )
1
2
|
,

S
0 =
c
c
Y X X) X ( b
' '
=
1
Gaurav Garg (IIM Lucknow)
Example: Consider the pie example.
We want to fit the model
The variables are
Y: Pie sales (units per week)
X
1
: Price (in $)
X
2
: Advertising Expenditure ($100s)
Using the matriX formula, the least squares estimate
(LSE) of s are obtained as below:



Pie Sales = 306.53 24.98 Price + 74.13 Adv. Expend.
,
2 2 1 1 0 i i i i
X X Y + + + =
LSE of Intercept
0
Intercept (b
0
) 306.53
LSE of slope
1
Price (b
1
) -24.98
LSE of slope
2
Advertising (b
2
) 74.13
Gaurav Garg (IIM Lucknow)

) ( 13 74 ) ( 98 24 53 306 Sales
2 1
X . X . - . + =
b
1
= -24.98: sales will decrease, on
average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b
2
= 74.13: sales will
increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.
Gaurav Garg (IIM Lucknow)
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising eXpenditure is $350:
Sales = 306.53 24.98 X
1
+ 74.13 X
2
= 306.53 24.98 (5.50) + 74.13 ( 3.5)
= 428.62

Predicted sales is 428.62 pies
Note that Advertising is in $100s, so X
2
= 3.5
Gaurav Garg (IIM Lucknow)
Y X
1
X
2
Predicted Y Residuals
350 5.5 3.3 413.77 -63.80
460 7.5 3.3 363.81 96.15
350 8.0 3.0 329.08 20.88
430 8.0 4.5 440.28 -10.31
350 6.8 3.0 359.06 -9.09
380 7.5 4.0 415.70 -35.74
430 4.5 3.0 416.51 13.47
470 6.4 3.7 420.94 49.03
450 7.0 3.5 391.13 58.84
490 5.0 4.0 478.15 11.83
340 7.2 3.5 386.13 -46.16
300 7.9 3.2 346.40 -46.44
440 5.9 4.0 455.67 -15.70
450 5.0 3.5 441.09 8.89
300 7.0 2.7 331.82 -31.85
2 1
13096 74 97509 24 52619 306

X . X . . Y + =
Gaurav Garg (IIM Lucknow)
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
Predicted Y
Gaurav Garg (IIM Lucknow)
Coefficient of Determination
Coefficient of Determination (R
2
) is obtained using the
same formula as was in simple linear regression.






R
2
= SSR/SST = 1 (SSE/SST)
R
2
is the proportion of variation in Y explained by
regression.

=
=
n
i
i
Y Y
1
2
) ( SST Squares, of Sum Total

=
=
n
i
i
Y Y
1
2
)

( SSR Squares, of Sum Regression

=
=
n
i
i i
Y Y
1
2
)

( SSE Squares, of Sum Error


Also, SST = SSR + SSE
Gaurav Garg (IIM Lucknow)
Since SST = SSR + SSE
and all three quantities are non-negative,
Also, 0 SSR SST
So 0 SSR/SST 1
Or 0 R
2
1
When R
2
is close to 0, the linear fit is not good
And X variables do not contribute in explaining the
variability in Y.

When R
2
is close to 1, the linear fit is good.

In the previously discussed example, R
2
= 0.5215

If we consider Y and X
1
only, R
2
=0.1965

If we consider Y and X
2
only, R
2
=0.3095




Gaurav Garg (IIM Lucknow)
Adjusted R
2

If one more regressor is added to the model, the value
of R
2
will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R
2
is defined, which is called as
adjusted R
2
and defined as


This Adjusted R
2
will only increase, if the additional
variable contribute in explaining the variation in Y.
For our example, Adjusted R
2
= 0.4417
) (n SST
) k (n SSE
R
Adj
1
1
1
2


=
Gaurav Garg (IIM Lucknow)
F-Test for Overall Significance
We check if there is a linear relationship between all the
regressors (X
1
, X
2
, , X
k
) and response (Y).
Use F test statistic
To test:
H
0
:
1
=
2
=

=
k
=0 (no regressor is significant)
H
1
: at least one
i
0 (at least one regressor affects Y)
The technique of Analysis of Variance is used.
Assumptions:
n >k, Var(c
i
)=o
2
, E(c
i
)=0.
c
i
s are independent. This implies that Corr (c
i
, c
j
) = 0, for i j
c
i
s have Normal Distribution. [c
i
~N(0, o
2
)]
[NEW ASSUMPTION]
Gaurav Garg (IIM Lucknow)
Total Sum of Square (SST) is partitioned into
Sum of Squares due to Regression (SSR) and
Sum of Squares due to Residuals (SSE)
where





e
i
s are called the residuals.







( )
( )
SSE SST SSR
Y Y e SSE
Y Y SST
n
i
i i
n
i
i
n
i
i
=
= =
=

= =
=
1
2
1
2
2
1

Gaurav Garg (IIM Lucknow)


Analysis of Variance Table



Test Statistic: F
c
=MSR / MSE ~F
(k, n-k-1)

For the previous eXample, we wish to test
H
0
:
1
=
2
= 0 Against H
1
: at least one
i
0
ANOVA Table




Thus H
0
is rejected at 5% level of significance.
df SS MS F
c
Regression k SSR MSR MSR/MSE
Residual or Error n-k-1 SSE MSE
Total n-1 SST
df SS MS F F
(2,12)
(0.05)
Regression 2 29460.03 14730.01 6.5386 3.89
Residual or Error 12 27033.31 2252.78
Total 14 56493.33
Gaurav Garg (IIM Lucknow)
Individual Variables Tests of Hypothesis
We test if there is a linear relationship between a
particular regressor X
j
and Y
Hypotheses:
H
0
:
j
= 0 (no linear relationship)
H
1
:
j
0 (linear relationship exists between X
j
and Y)
We use a two tailed t-test
If H
0
:
j
= 0 is accepted,
this indicates that the variable X
j
can be deleted
from the model.
Gaurav Garg (IIM Lucknow)
Test Statistic:

T
c
~ Students t with (n-k-1) degree of freedom

b
j
is the least squares estimate of
j

C
j j
is the (j, j)th element of matrix (XX)
-1

(MSE is obtained in ANOVA Table)
jj
j
c
C
b
T
2
o
=
MSE =
2

o
Gaurav Garg (IIM Lucknow)
In our example
and

To test H
0
:
1
= 0 against H
1
:
1
0
T
c
= -2.3057
To test H
0
:
2
= 0 against H
1
:
2
0
T
c
=2.8548
Two tailed critical values of t at 12 d.f. are
3.0545 for 1% level of significance
2.6810 for 2% level of significance
2.1788 for 5% level of significance
2252.7755

2
= o
|
|
|
.
|

\
|



= '

2993 0 0038 0 0165 1
0038 0 0521 0 3312 0
0165 1 3312 0 7946 5
1
. . .
. . .
. . .
X) X (
Gaurav Garg (IIM Lucknow)
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
It is given by
1
)

(
1
1
2

=

=

=
k n
Y Y
k n
SSE
S
n
i
i i
YX
Gaurav Garg (IIM Lucknow)
Assumption of Linearity
Not Linear
Linear
r
e
s
i
d
u
a
l
s

Y
X
Y
X
r
e
s
i
d
u
a
l
s

Y

Gaurav Garg (IIM Lucknow)


Assumption of Equal Variance
We assume that Var(c
i
)=o
2
The variance is constant for all observations.




This assumption is examined by looking at the
plot of
Predicted values and residuals

i
Y

i i i
Y Y e

=
Gaurav Garg (IIM Lucknow)
Residual Analysis for Equal Variance
Unequal variance Equal variance
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Y

Gaurav Garg (IIM Lucknow)


Assumption of Uncorrelated Residuals
DurbinWatson statistic is a test statistic used to detect
the presence of autocorrelation.
It is given by


The value of d always lies between 0 and 4.
d = 2 indicates no autocorrelation.
Small values of d < 2 indicate successive error terms are
positively correlated.
If d > 2 successive error terms are negatively correlated.
The value of d more than 3 and less than 1 are alarming.

=
=

=
n
i
i
n
i
i i
e
e e
d
1
2
2
2
1
) (
Gaurav Garg (IIM Lucknow)
Residual Analysis for Independence
(Uncorrelated Errors)
Not Independent Independent
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

Y

Gaurav Garg (IIM Lucknow)


Assumption of Normality
When we use F test or t test, we assume that c
1
,
c
2
, , c
n
are normally distributed.
This assumption can be examined by histogram
of residuals.
NOT NORMAL
NORMAL
Gaurav Garg (IIM Lucknow)
Normality can also be examined using Q-Q plot
or Normal probability plot.
NOT NORMAL
NORMAL
Gaurav Garg (IIM Lucknow)
Standardized Regression Coefficient
In a multiple linear regression, we may like to know
which regressor contributes more.
We obtain standardized estimates of regression
coefficients.
For that, first we standardize the observations.
1
2
2
1 1
2
1 1 1 1
1 1
2
2 2 2 2
1 1
1 1
, ( )
1
1 1
, ( )
1
1 1
, ( )
1
n n
i Y i
i i
n n
i X i
i i
n n
i X i
i i
Y Y s Y Y
n n
X X s X X
n n
X X s X X
n n
= =
= =
= =
= =

= =

= =




Gaurav Garg (IIM Lucknow)
Standardize all Y, X
1
and X
2
values as follows:





Fit the regression in the standardized data and obtain
the least squares estimate of regression coefficients.
These coefficients are dimensionless or unit-free and
can be compared.
Look for the regression coefficient having the highest
magnitude.
Corresponding regressor contributes the most.
1 2
1 1 2 2
1 2
Standardized ,
Standardized , Standardized
i
Y
i i
i i
X X
Y Y
Y
s
X X X X
X X
s s

=

= =
Gaurav Garg (IIM Lucknow)

Y =0 0.461 X
1
+0.570 X
2

Since 0.461 < 0.570


X
2
Contributes the most
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
1
-0.78 -0.95 -0.37
2
0.96 0.76 -0.37
3
-0.78 1.18 -0.98
4
0.48 1.18 2.09
5
-0.78 0.16 -0.98
6
-0.30 0.76 1.06
7
0.48 -1.80 -0.98
8
1.11 -0.18 0.45
9
0.80 0.33 0.04
10
1.43 -1.38 1.06
11
-0.93 0.50 0.04
12
-1.56 1.10 -0.57
13
0.64 -0.61 1.06
14
0.80 -1.38 0.04
15
-1.56 0.33 -1.60
Gaurav Garg (IIM Lucknow)
Note that:




Adjusted R
2
can be negative
Adjusted R
2
is always less than or equal to R
2
Inclusion of intercept term is not necessary.
It depends on the problem.
Analyst may decide on this.
) 1 (
) 1 )( 1 (
1
2
2


=
k n
n R
R
Adj
) 1 (
) 1 (
2
2
R k
R k n
F
c


=
Gaurav Garg (IIM Lucknow)
Example: Following data was collected for the sales, number of
advertisements published and advertizing expenditure for 12
weeks. Fit a regression model to predict the sales.
Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.2
22.6 6 8.4
37.6 8 11.2
35.2 10 11.1
Gaurav Garg (IIM Lucknow)
ANOVA
b

Model
Sum of
Squares df Mean Square F Sig.
1 Regression 309.986 2 154.993 9.741 .006
a

Residual
143.201 9 15.911
Total
453.187 11
a. Predictors: (Constant), Ex_Adv, No_Adv
b. Dependent Variable: Sales
Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant)
6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
a. Dependent Variable: Sales
p-value < 0.05; H
0
is rejected; All s are not zero
All p-values > 0.05; No H
0
rejected.
0
=0,
1
=0,
2
=0
CONTRADICTION
Gaurav Garg (IIM Lucknow)
Multicollinearity
We assume that regressors are independent variables.

When we regress Y on regressors X
1
, X
2
, , X
k
.
We assume that all regressors X
1
, X
2
, , X
k
are
statistically independent of each other.

All the regressors affect the values of Y.
One regressor does not affect the values of other
regressor.

Sometimes, in practice this assumption is not met.

We face the problem of multicollinearity.

The correlated variables contribute redundant information
to the model
Gaurav Garg (IIM Lucknow)
Including two highly correlated independent variables can
adversely affect the regression results

Can lead to unstable coefficients

Some Indications of Strong Multicollinearity:
Coefficient signs may not match prior expectations

Large change in the value of a previous coefficient when a new
variable is added to the model

A previously significant variable becomes insignificant when a
new independent variable is added.

F says at least one variable is significant, but none of the ts
indicates a useful variable.

Large standard error and corresponding regressors is still
significant.

MSE is very high and/or R
2
is very small
Gaurav Garg (IIM Lucknow)
EXAMPLES IN WHICH THIS MIGHT HAPPEN:
Miles per gallon Vs. horsepower and engine size
Income Vs. age and experience
Sales Vs. No. of Advertisement and Advert. Expenditure
Variance Inflationary Factor:
VI F
j
is used to measure multicollinearity generated
by variable X
j

It is given by


where R
2
j
is the coefficient of determination of a
regression model that uses
X
j
as the dependent variable and
all other X variables as the independent variables.

2
1
1
j
j
R
VIF

=
Gaurav Garg (IIM Lucknow)
If VI F
j
> 5, X
j
is highly correlated with the other
independent variables
Mathematically, the problem of multicollinearity occurs
when the columns of matrix X have near linear
dependence
LSE b can not be obtained when the matrix XX is singular
The matrix XX becomes singular when
the columns of matrix X have exact linear dependence
If any of the eigen value of matrix XX is zero
Thus, near zero eigen value is also an indication of
multicollinearity.
The methods of dealing with multicollinearity:
Collecting Additional Data
Variable Elimination
Gaurav Garg (IIM Lucknow)
Coefficients
a

Model
Unstandardized
Coefficients
Standardize
d
Coefficients
t Sig.
Collinearity
Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591 .199 5.022
Ex_Adv 2.139 1.470 .611 1.455 .180 .199 5.022
a. Dependent Variable: Sales
Collinearity Diagnostics
a

Model Dimension Eigenvalue
Condition
Index
Variance Proportions
(Constant) No_Adv Ex_Adv
1 1 2.966 1.000 .00 .00 .00
2 .030 9.882 .33 .17 .00
3 .003 30.417 .67 .83 1.00
a. Dependent Variable: Sales
Greater than 5 Tolerance = 1/VIF
Large Value Negligible Value
Gaurav Garg (IIM Lucknow)
We may use the method of variable elimination.
In practice, If Corr (X
1
, X
2
) is more than 0.7 or
less than -0.7, we eliminate one of them.
Techniques:
Stepwise (based on ANOVA)
Forward Inclusion (based on Correlation)
Backward Elimination (based on Correlation)

Gaurav Garg (IIM Lucknow)
Stepwise Regression
Y = |
0
+ |
1
X
1
+ |
2
X
2
+ |
3
X
3
+ |
4
X
4
+ |
5
X
5
+ c
Step 1: Run 5 simple linear regressions:
Y = |
0
+ |
1
X
1

Y = |
0
+ |
2
X
2

Y = |
0
+ |
3
X
3

Y = |
0
+ |
4
X
4

Y = |
0
+ |
5
X
5

Step 2: Run 4 two-variable linear regressions:
Y = |
0
+ |
4
X
4
+ |
1
X
1

Y = |
0
+ |
4
X
4
+ |
2
X
2
Y = |
0
+ |
4
X
4
+ |
3
X
3
Y = |
0
+ |
4
X
4
+ |
5
X
5

<==== has lowest p-value (ANOVA) < 0.05
<= has lowest p-value (ANOVA) < 0.05
Gaurav Garg (IIM Lucknow)
Step 3: Run 3 three-variable linear regressions:
Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
1
X
1

Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
2
X
2
Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
5
X
5


Suppose none of these models have
p-values < 0.05
STOP
Best model is the one with X
3
and X
4
only
Gaurav Garg (IIM Lucknow)
Example: Following data was collected for the sales, number of
advertisements published and advertizing expenditure for 12
months. Fit a regression model to predict the sales.
Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.2
22.6 6 8.4
37.6 8 11.2
35.2 10 11.1
Gaurav Garg (IIM Lucknow)
Summary Output 1: Sales Vs. No_Adv
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate
1 .781
a
.610 .571 4.20570
a. Predictors: (Constant), No_Adv
ANOVA
b

Model Sum of Squares df Mean Square F Sig.
1 Regression 276.308 1 276.308 15.621 .003
a

Residual
176.879 10 17.688
Total
453.187 11
a. Predictors: (Constant), No_Adv
b. Dependent Variable: Sales
Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant)
16.937 4.982 3.400 .007
No_Adv 2.083 .527 .781 3.952 .003
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Summary Output 2: Sales Vs. Ex_Adv
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate
1 .820
a
.673 .640 3.84900
a. Predictors: (Constant), Ex_Adv
ANOVA
b

Model Sum of Squares df Mean Square F Sig.
1 Regression 305.039 1 305.039 20.590 .001
a

Residual
148.148 10 14.815
Total
453.187 11
a. Predictors: (Constant), Ex_Adv
b. Dependent Variable: Sales
Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant)
4.173 7.109 .587 .570
Ex_Adv 2.872 .633 .820 4.538 .001
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Summary Output 3: Sales Vs. No_Adv & Ex_Adv
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate
1 .827
a
.684 .614 3.98888
a. Predictors: (Constant), Ex_Adv, No_Adv
ANOVA
b

Model Sum of Squares df Mean Square F Sig.
1 Regression 309.986 2 154.993 9.741 .006
a

Residual
143.201 9 15.911
Total
453.187 11
a. Predictors: (Constant), Ex_Adv, No_Adv
b. Dependent Variable: Sales
Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant)
6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Qualitative Independent Variables
Johnson Filtration, Inc., provides maintenance
service for water filtration systems throughout
southern Florida.
To estimate the service time and the service cost,
the managers want to predict the repair time
necessary for each maintenance request.
Repair time is believed to be related to two
factors-
Number of months since the last maintenance
service
Type of repair problem (mechanical or electrical)
Gaurav Garg (IIM Lucknow)
Data for a sample of 10 service calls are given:






Let Y denote the repair time, X
1
denote the number of
months since last maintenance service.
Regression Model that uses X
1
only to regress Y is
Y=
0
+
1
X
1
+


Service Call
Months Since Last
Service Type of Repair
Repair Time in
Hours
1 2 electrical 2.9
2 6 mechanical 3.0
3 8 electrical 4.8
4 3 mechanical 1.8
5 2 electrical 2.9
6 7 electrical 4.9
7 9 mechanical 4.2
8 8 mechanical 4.8
9 4 electrical 4.4
10 6 electrical 4.5
Gaurav Garg (IIM Lucknow)
Using least squares method, we fitted the model as

R
2
=0.534
At 5% level of significance, we reject
H
0
:
0
= 0 (Using t test)
H
0
:
1
= 0 (Using t and F test)
X
1
alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as


Regression Model that uses X
1
and X
2
to regress Y is
Y=
0
+
1
X
1
+
2
X
2
+
Is the new model improved?





1
3041 . 0 1473 . 2

X Y + =

=
electrical is repair of type if 1,
mechanical is repair of type if , 0
2
X
Gaurav Garg (IIM Lucknow)
Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)
-1
XY
R
2
and adjusted R
2

Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality

Gaurav Garg (IIM Lucknow)

You might also like