You are on page 1of 44

Assumptions of Classical Linear

Regression Models (CLRM)


The CLRM is used to handle the problem of
estimation and prediction (forecasting).
Forecasts are dependent on the estimated
parameters in the models.
However, these model are based on several
assumptions fulfillment of which makes the
estimates Best Liner Unbiased Estimate (BLUE)
A few assumptions are important from the point
of estimation and forecasting.
Reference: DNGujarati Ch 3,10&12
Heteroscedasticity
An important assumption of CLRM is that the
disturbance term 'U i appearing in the
population regression function are
homoscedastic i.e. all of them have the same
variance.
In other words:
The variance of each disturbance term
conditional upon the chosen values of the
explanatory variables in repeated sampling is
some constant number equal to б 2 .
The diagonal term in the variance-covariance matrix = б 2
Variance –Covariance Matrix E (UU’)=σ2 I
This is known as homoscedastic variance
σ2 0 0 0 0 0
0 σ2 0 0 0 0
0 0 σ2 0 0 0
0 0 0 σ2 0 0
0 0 0 0 σ2 0
0 0 0 0 0 σ2
Example:Three samples have been used in a survey.
The study examines monthly expenditure as a function of
income(INR in ‘000)
Incom 80 100 120 140 160 180 200 220
e>

Expdr 61 70 90 91 121 110 115 168

63 74 96 92 95 98 105 109

71 75 98 85 93 105 108 112

68 78 108 97 115 128 159 145

69 89 101 100 120 145 109 170

70 80 103 98 124 155 126 107

69 80 78 125 99 100 138 119


Heteroscedasticity Contd….
Reasons for Unequal vaviances
Examples & Reference:
Basic econometrics by DNGujarati
Income-Saving Model
As income grows people have more
discretionary income and hence more scope
for choices about the disposition of their
income. Hence the variances ( б 2 i ) for given
values of income in repeated sampling is
likely to increase.
Heteroscedasticity Contd….
Reasons for Unequal vaviances
Profit and Dividend Policy
Companies with larger profits have greater
variability in their dividend policies. Growth
oriented companies may show more
variability in their dividend payout ratio.
R&D Expenditure and Sales.
In a functional relationship using sale as a
function R&D Expenditure inequality of
variance is likely.
Heteroscedasticity Contd….
Reasons for Unequal vaviances

Heteroscedastity can also arise as a result


of the presence of an outlier.
Inclusion or exclusion of such an
observation, in case of small sample size,
substantially alter the results of the
regression analysis and hence estimation.
Misspecification of a model.
Either in variable or in functional form.
Heteroscedasticity Contd….
Reasons for Unequal vaviances

Skewness in the distribution of one or more


regressors. For example, Income of
individuals as a regressors which has high
uneven distribution.
One or many of the above reasons may lead
to the problem of Heteroscedasticity in the
estimation process.
The diagonal term in the Matrix will not be a
constant .

Heteroscedasticity is more common in Cross
Section Data as compared to Time Series

Reasons:
Cross section data deal with members of a
sample/population at a given point of time.
Data relating to individual consumers, firms,
industries etc who may be different or may be
heterogeneous in character.
In time series data the variable tend to be of
similar order since the entity is same and data
are collected over a period of time.
A rule in CS data rather than an exception.
Consequence of and Test for the
presence/absence of Heteroscedasticity

Estimation of the coefficients in the presence


of heteroscedasticity does not give minimum
variance of β.
What is relevance of the above?
Thus, when we use cross section data and
estimate the relationship we must test the
presence/absence of heteroscedasticity.
Since the problem is related to the U I and its
distribution the tests requires that we must
have the estimated U i (U Caps)
Test for the presence/absence of
Heteroscedasticity
There are few tests to arrive at the conclusion
on the presence/absence of heteroscedasticity
in a regression model.
1. Graphical Method: See Graph 2
It gives an idea about the presence or abosence
of Heteroscedasticity
2.Park Test : ln e2i = α + β Xi + vi
In this relationship, if β turns out to be
statistically significant , it suggests the
presence of heteroscedasticy in the data.
Test for the presence/absence of
Heteroscedasticity
3. Spearman's Rank Correlation test.
4.Goldfeld Quandt Test .
In the presence of Hetr. the usual procedure
is to transform the data to get a set of
revised coefficients which could be used for
estimation
Data transformation relates to the possible
assumption about the heteroscedasticity
pattern observed in the data.
Assumptions on Heteroscedasticity
pattern
Let us assume that the error variance is
proportional to Xi 2 or to Xi . In the two cases
the data transformation would lead to re-
specification of the model as follows:
1. Yi / Xi = β1 / Xi + β2 + U I / Xi
Or = β1 1 / Xi + β2 + v ,similarly
2. Yi /  Xi = β1 / Xi + β2  Xi + U I /  Xi
Or = β1 1 / Xi + β2  Xi + v ,
Example: Estimated Equation for R&D
Expdr and Sales of 18 industry groups
R&D = 192.99 + 0.0319 Sales
s.e. (990.99) (0.0083)
t (0.1948) ( 3.8434) R2 = 0.47
A case of proportionality of the error
variance to the sale volume square is
observed.
A squareroot transformation has been
attempted and the equation reestimated.
Exple ….Contd
R&D/  Sale = -246.681/  Sale + 0.0368  Sale
s.e. 341.13 0.0071
t 0.6472 5.17 R 2 = 0.47
On multiplying  Sale both sides you get the original
equation
The estimates of the dependent variable could be found
using the revised statistics.
Refer to Chapter 11 in Basic Econometrics by DN
Gujarati on Heteroscedasticity and its relevance in
different situations for data analysis.
(Proofs have not been given here. Students may see
these in Johnston’s Book)
Exercise 2.1

Use cross section data from your MR exercise.


Take the variable having the highest
Standardized Co-efficcient.
Use two variable regression equation.
Test for the presence of Heteroscedasticity.
In the presence of it transform the model .
Refer to Chapter 11 of DN Gujarati’s Book . You
may ignore the proofs.
Prepare PPTs for presentation and discussion in
the next class.
MULTICOLLINEARITY
& AUTOCORRELATION
Multicollinearity
• The theory of causation and multiple causation
• Interdependence between the Independent Variables and
variability of Dependent Variables
• Parsimony and Linear Regression
• Theoretical consistency and Parsimony

X5

X4

Y X1

X3

X2
•One of the assumptions of the CLRM is that
there is no Multicollinearity amongst the
explanatory variables.
•Multicollinearity refers to perfect or exact
relationship among some or all explanatory
variables
Expl.: X1 X2 X*2
10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
X2i = 5X1i & X*2 was created by adding 2,
0, 7, 9 & 2 from, a random number table.
Here r1.2 = 1 & r2.2* = 0.99
X1 & X2 show perfect
multicollinearity
X2 & X*2 near-perfect
multicollinearity
•The problem of multicollinearity and its
degree in types of data
•Overlap between the variables indicates
Example:
Y = a + b1x1 + b2x2 + u
where
Y = Consumption Expenditure
X1 = Income & X2 = Wealth
Consumption expenditure depends on income (x1) and
wealth (x2)
•The estimated equation from a set of data is as follows:
Ŷ = 24.77 + 0.94x1 – 0.04x2

‘t’ : (3.66) (1.14) (0.52)


R2 = 0.96 Ř2 = 0.95 F = 92.40
The individual β coefficients are not significant although ‘F’ value
suggests a high degree of association
There is a wrong sign with x2
The fact that the ‘F’ test is significant but the ‘t’
values of X1 and X2 are individually
insignificant means that the two variables are
so highly correlated that it is impossible to
isolate the individual impact of either income
or wealth on consumption.
Let us regress X2 on X1
X2 = 7.54 + 10.19 X1
‘t’ = (0.25) (62.04) R2 0.99
This shows near perfect multi-collinearity
between X2 and X1
Y on X1 Y on X2

Ŷ = 24.24 + 0.51X1 Ŷ = 24.41 + 0.05 X2

‘t’ = (3.81) (14.24) ‘t’ = (3.55) (13.29)

R2 = 0.96 R2 = 0.96

Wealth has significant impact

Dropping highly collinear variable has made the


other variable significant.
Sources of Multicollinearity
(Proofs are not given)

• Data collection method employed:


Sampling over a limited range of the values taken by
the regressors in the population.
• Constraints on the model or in the population
being sampled:
Regression of electricity consumption on income and
house size. There is a constraint : families with
higher income may have larger homes.
Sources of Multicollinearity(Contd)
• Model specification:
Adding polynomial terms to a model when
range of X variable is small
• An Over determined Model:
This happens when the model has more
explanatory variables than the number
of observations.
• Use of time series data:
Model share a common trend.
Practical Consequences of
Multicollinearity:
In cases of near perfect or high multicollinearity one
is likely to encounter the following consequences:

1. The OLS estimators have large variances and co-


variances making precise estimation difficult.
2. (a) Because of ‘1’ the confidence intervals tend to
be much wider leading to the acceptance of the
“zero” null hypothesis (i.e. the true population
coefficient is zero) more readily.
(b) Because of ‘1’ the ‘t’ ratios of one or more
coefficients tend to be statistically insignificant.
(Proofs for these may be seen in Johnston’s Book)
Detection of Multicollinearity

1. High R2 but few significant ‘t’ values


2. High pair wise correlation amongst
regressors (seen from correlation
matrix)
3. Examination of partial correlation
4. Auxiliary Regressors and F-test
(regress each xi on remaining xis.
Find ‘F’ values and decide).
5. Eigen values and condition index.
Correlation Matrix
X1 X2 X3 X4
X1 1 0.35 0.61 0.32

X2 0.35 1 0.95 0.41

X3 0.61 0.95 1 0.56

X4 0.32 0.41 0.56 1


Remedial Measures

1. A priori information and articulation


2. Dropping a highly collinear variable
3. Transformation of Data
4. Additional information or new data
5. Identifying the purpose and
reducing the degree of it. (Or)
Simply identifying it if the purpose
is prediction.
AUTOCORRELATION
The assumption E(UU’) = б2 I 
• Each u distribution has the same variance
(homoscedastic)
• All disturbances are pair wise uncorrelated
This assumption gives
Var u1 Cov (U1 U2) . . . Cov (U1, U2) б2 0 ... 0
Cov (U2 U1) Var V2 ... Cov (U2, Un) 0 б2 ... 0
.... .... ... .... = ... ... ... ...
Cov (unU1) Cov(Un U2) ... (Var Un) 0 0 ... б2
E(UiUj) = 0 ij
This assumption when violated leads to:
1. Heteroscedasticity
2. Autocorrelation
Covariance is the measure of how much
two random variables vary together (as
distinct from variance, which measures
how much a single variable varies.)
Covariance between two random variables
say X and Y is defined as
Cov (X, Y) = E [(X -  )(Y-  )]
Where  and  are expected values of X
and Y respectively.
If X and Y are independent their cov. is
“Zero”
The assumption implies that the
disturbance term relating to any
observation is not influenced by the
disturbance term relating to any other
observation.
For example:
1. If we are dealing with quarterly time
series data involving the regression of
the following specification. (Time
Series Data)
Output (Q) = f (Labour and Capital

Q L K U Input)
Q 1.1 L1 K1 U1
Output is There is no
Q 1.2 L2 K2 U2
affected reason to believe
due to Q 1.3 L3 K3 U3
that this will be
labour Q 1.4 L4 K4 U4 carried over to
strike Q 2.1 ... ... ... U4
... ... ... ...
... ... ... ...
... ... ... ...
Q n.4 L4n K4n U4n
2. Let
Family Consumption Expenditure = f (income)
(A regression involving Cross Section Data)

Consumption Income of
Expenditure of Family
Families
F1 I1 U1
F2 I2 U2
... ... ...
... ... ...
... ... ...
... ... ...
Fn In Un
The effect of an increase of one family’s income on
consumption expenditure is not expected to affect
the consumption expenditure of another family.

The reality in Expl 1 and 2:


1. Disturbances caused by strike may affect
production in another quarter
2.Consumption expenditure of one family may
influence that of another family i.e.
To keep up with the Joneses – Demonstration effect
•Autocorrelation is a feature in most time-series data.
•In cross section data it is referred to as ‘spatial
autocorrelation’.
Reasons for autocorrelation
1. Inertia:
A salient feature of most economic series is
inertia.
Time series data such as PCI, price indices,
production, profit, employment etc. exhibit
cycles. Starting at the bottom of a recession,
when economic recovery starts, most of these
series move upward. In this upswing the value
of the series at one point of time is greater
than its previous value. Thus, there is a
“momentum” built into that an it continues
until something happens to slow them down.
Therefore, in regression involving time series data
successive observations are likely to be inter-
dependent which reflect in a systematic pattern of the
u i s.
2. Specification bias:
Excluded variable(s) or incorrect functional form.
a)When some relevant variables have been excluded
from the model they will reflect a systematic pattern
in the ui s.
b)In case of incorrect functional form i.e. fitting a
linear function when the true relationship is non-
linear (& vice-versa), there will either be over
estimation or under estimation of the dependent
Example:
(Correct) MC = β1 + β2 output + β3 (output)2 + Ui
(Incorrect) MC = b1 + b2 output + Vi
Where vi = (output)2 + ui and hence it will catch the
systematic effect of (output)2 on the MC leading to
serial correlation of uis.
3. Cobweb Phenomenon:
Supply of many agricultural commodities reflect the
so called “Cobweb-Phenomenon” where supply
reacts to price with lag of one time period because
supply decisions take time to implement (gestation
period).
Expl. : At the beginning of this year’s planting of crops
Suppose at the end of period ‘t’ price Pt
turns out to be lower than Pt-1. Therefore,
in period t +1 the farmer may decide to
produce less than they did in period ‘t’.
Such phenomena are known as Cobweb
Phenomena. And they give a systematic
pattern to the Uis.
In cases of Household Expenditure,
share prices etc. such problem arises. In
general, when lagged variable is not
Consequences: (Proofs are not given)
In the presence of autocorrelation in a
model:
a)Residual variance is likely to under
estimate the true б2.
b)R2 is likely to be over estimated.
c)‘t’ test are not valid and if applied likely
to give misleading conclusions.
OLS estimators although linear and unbiased, they
do not have minimum variance leading to invalid ‘t’
and ‘F’ test.
Detection of autocorrelation:
The assumption of the CLRM relates to the
population disturbance term which are not directly
observable.
Therefore, their proxies i.e. ûis are obtained from OLS
and examined for the presence/absence of auto
correlation.
There are various methods. Some of them are:
1.Graphical Method
2.Runs Test (a non-parametric test) – Examines the
signs n

 u ˆ  u
t
ˆ 
t 1
2

d  t 2

3.DW-statistics n
u
ˆt 2
t 1
Remedial Measures:
Data transformation by
a)First difference method (Xt+1 – Xt) (one
degree of freedom is lost)
b)‘’ – transformation P̂   e e t t 1

– Estimated e 2
t

The transformed model becomes


(Yt -  Yt-1) = β1(1- )+β2(Xt- Xt-1)+ut
This is known as generalized or quasi-
difference equation. Y     X  u
t
* *
1
*
2
*
t t
Exercise 2.2 ( Refer Ch 10&12 of
DNGujarati)
Use time series data in MR
Find Correlation table
See the extent of multicollinearity
Test for autocorrelation
In the presence of it use ‘Ro’ transformation
Addressing both the problems calculate forecast errors
and select an equation which gives the minimum
forecast error.
Both the Exercies (2.1 and 2.2 ) will be presented by the
groups in the next class.

You might also like