Professional Documents
Culture Documents
Abstract
This note presents differents orthogonalizations of regressors facing near-
collinearity and constrained parameters regressions.
JEL classification: C10, C12
Keywords: Collinearity, Student t Statistic.
1. Introduction
Near-collinearity (or near-multicollinearity or collinearity) between explanatory vari-
ables is defined for high values of multiple correlations between explanatory variables
in multiple regression. Near collinearity does not invalidate the assumptions of ordi-
nary least squares, as long as the collinearity is not perfect: in this case, the estimator
cannot be computed. Near-collinearity may lead to high values of the estimated pa-
rameters, high values of the estimated variance of each parameters, and low values of
the t statistics. Near-collinearity is a problem for the selection of variables.
As a consequence, a general to specific specification search usually orthogonalizes
the regressors, in order to select relevant variables. However, there exists different
methods for the orthogonalization of regressors, leading to different t-statistics and
different inference on selected variables. This paper deals with two of these orthogo-
nalization methods: the Gram Schmidt Choleski hierarchical orthogonalization, where
the order of the variables matters in the orthogonalization, and a more ”egalitarian”
treatment of regressors: the standard principal component analysis. Note that Hall
and Fomby [2003] and Buse [1984] mention that the parameter of the favoured regres-
sor in the Gram Schmidt orthogonalization method may be biased and inconsistent.
∗
Centre d’Economie de la Sorbonne (CES), Paris School of Economics, and CEPREMAP, Uni-
versity Paris 1 Pantheon Sorbonne, E-mail: jean-bernard.chatelain@univ-paris1.fr
†
ESCE, Graduate School of International Trade, Paris, CEPREMAP, E-mail: kirsten.ralf@esce.fr
On the other hand, Hendry and Nielsen [2007] mention that a hierarchy is explained
by substantive reasons, and propose to use it for quadratic models.
A particular case of models plagued by near-multicollinearity are multiple poly-
nomial models, such as quadratic models with interaction terms (Hendry, Krölzig,
etc.). Multiple polynomial models of order 2, 3 or 4 are a local approximation of more
general non-linear models. As such, there have been used as a specification to test
for non-linear model (White [1980], Castle and Hendry [2009]). Castle and Hendry
[2008] propose a low dimension collinearity robust test for non linearity. In a ingenu-
ous test, Castle and Hendry [2008] orthogonalize the quadratic and interaction terms
using the principal component analysis, which greatly reduces the dimensionality of
the White test [1980]. A potential extension of their test is to orthogonalized first all
the quadratic and interaction terms with respect to all the linear terms (favored by
Hendry and Nielsen [2007]) and then go on with these residuals for their tests.
We propose here a non-linear test where all the orthogonalizations possible for
all the possible hierarchy or orders of the Gram-Schmidt-Choleski procedure and the
Principal Component are used: the specification which is chosen is the one with the
highest t-statistic. This note presents differents orthogonalizations of regressors facing
near-collinearity and constrained parameters regressions.
To evaluate the procedure, we use a famous data set with four polar cases using
Anscombe’s quartet. We use Anscombe first three small samples of 11 observations
with the same correlation coefficient of the dependent variable and the regressor,
and the same correlation coefficient of the regressor and its square (and a high near-
multicolllinearity, but not an exact one, which is the case of Anscombe fourth sample).
The first sample is related to a true linear model, the second to a true quadratic model,
the third to a true linear model with an outlier, and we construct a fourth one of a
true quadratic model with zero or near-zero correlation of the dependant variable with
the regressor and the square of the regressor (described as a near-spurious regression”,
in a companion paper (Chatelain and Ralf [2009]). The outcome of the procedure are
as follows.
- In the first case, a pair of non orthogonalized near-collinear would be eliminated
in a general to specific approach (with a power of the t test very low, because the
gain in R2 by adding one regressor with respect to the simple regression is very small)
whereas the procedure that we propose would retain only the linear term, which is
prefered to the first principal component and to the quadratic term only.
- In the second case, a pair of non orthogonalized near-collinear is kept in a general
to specific approach, as well as both orthogonalized regressors in the three method (2
gram Schmidt and the complete PC-OLS). A regression with near-collinear variable
may be very precise, because the root mean square error of the residual is very small,
so that orthogonalization is not necessary.
- In the third case, the method selects the square term only model (a non linear
model). However, this selection procedure fails to notice that the non linear model is
chosen only because of an outlier. The true model is a linear one with an outlier. A
2
model with the linear term and a dummy for the outlier outperforms the quadratic
model by a gain of 32% of variance for the coefficient of determination.
- In the fourth case, the method selects both regressors, with a large power of the
t-test (the gain in R square when adding any of the two regressors with respect to the
simple regression is near unity). However, this results presents a substantive paradox.
First, each of the regressors explain 0% or 2% of the variance of the dependent vari-
able, whereas together they explain 100% of the variance of the dependent variable.
Second, the second principal component or the residual of one regressors with respect
to the other (both new variables are close to the difference of the two near-collinear
regressors), which accounts for 0.6% of the total variance of the two regressors, is
able to explain 99.4% of the variance of the dependent variable. This challenges the
commonly held view that the difference of a pair of near-collinear variables is usually
not precisely estimated (e.g; Verbeek [2000], p.40). The selection of the relevant or-
thogonalized variable is the second axis (with lowest variance of the regressors, and
hence, a large standard error of its parameters) instead of the first one as in other
cases (1, 2 and 3 ) is a particular property of near-spurious regressions. It is then a
substantive issue to state whether the difference of there two variables makes sense in
such a regression. This raises light on a old debate with respect to selecting principal
components on the bases of the eigenvalues or on the basis on t-statistics (Jolliffe). In
a SAS book, the insight proposed was that there was outliers leading to specific effects,
a mixture of the problem identified in case 2 (with r12 large) and the near-spurious
regression of case 4.
Another section investigates the ”ceteris paribus”. For example, Verbeek [2000]
suggests that the ”ceteris paribus” interpretation is not valid in case of near-multicollinearity.
This section goes further using the usual interpretation of standardized coefficients
when they exceed unity. In this case, a ceteris paribus interpretation over-forecast
extreme tails of the distribution of the dependent variable. In practice, full effect
simple correlation are reliable to evaluate the effect of near-collinear variables. The
logical consequences of Verbeek’s points is the following: the ceteris paribus is only
exactly valid with orthogonalized regressors, or with complete simple regression ef-
fects and not with partial correlation effects. The ”ceteris paribus” interpretation of
standardized coefficients larger than unity is dubious.
Finally, the paper discusses in practice how cases 2 and case 4 in various literature:
investment and polynomial adjustment costs, aid and growth literature.
The paper proceeds as follows. Section 1 presents the orthogonalization of regres-
sors. Section 2 presents various constrained parameters regressions
2. Near-collinearity: Definition
Let us consider a regression on standardized variables (hence, there is no constant in
the model). Bold letters corresponds to matrices and vectors.
x1 = Xk β + ε
3
where x1 is the vector of N observations of the explained variable, X.k is the matrix
where column i corresponds to the N observations of the regressor xi for 2 ≤ i ≤ k +1,
βb is a vector of k parameters to be estimated, and ε is a vector of random disturbances
following a normal distribution of mean zero and of variance σ.
Let us denote Rk+1 a block sample correlation matrix between all variables, includ-
ing the explained variable on the first row and column. The matrix Rk corresponds
to the correlation matrix of the regressors. One has rij2 ≤ 1 for 1 ≤ i ≤ k + 1 and
1 ≤ j ≤ k + 1.
" 0
#
1 r1 1 0
Rk+1 = with Rk = X Xk = [rij ] 2≤i≤k+1 and with r1 = X0k x1 = [r1i ]2≤i≤k+1
r1 Rk N k 2≤j≤k+1
Because r1 is a one column matrix, the Schur complement Rk+1 /Rk is a scalar.
It is equal to one minus the coefficient of determination of the multiple correlation
2
R1.23...k , so that: 0 ≤ Rk+1 /Rk ≤ 1. A property of Schur complement is (Puntanen
and Styan [2005]):
³ ´
2
0 ≤ det (Rk+1 ) = det (Rk+1 /Rk ) · det (Rk ) = 1 − R1.23...k+1 det (Rk ) ≤ det (Rk ) ≤ 1
Let us define:
(1) Exact collinearity between regressors for det (Rk ) = 0.
(2) Near-collinearity between regressors when 0 < det (Rk ) ≤ δ < 1, with δ is
relatively small, defined by a rule of thumb such as δ = 0.1.
(3) An exact multiple regression (exact collinearity between the explained variable
and its regressors) for det (Rk+1 ) = 0 and det (Rk ) 6= 0.
Problems related to near-collinearity:
(1) large (or small) estimated parameters.
(2) large (or small) estimated standard errors.
(3) small (or large) t statistics.
(4) small (or large) coefficient of determination.
(5) Sensitivity of the removal or the addition of observations in the sample: possible
large variation on all estimates, possible change of signs of the parameters, while the
parameters remains large in both the positive and the negative case.
(6) Poor forecasting properties out of the sample of estimation, while the model
may look good in the sample. This property is also related to some extent with
”over-fitting”: too many regressors increases the probability of near-collinearity.
(7) In automatic model selection, with near collinearity between relevant and irrel-
evant regressors, it would become difficult to correctly eliminate irrelevant regressors.
4
This can however be avoided through orthogonalization. The order of the orthogonal-
ization will matter, and results in different selected models, which could be compared us-
ing encompassing and evaluated in the substantive context (Hendry and Nielsen [2007],
page 297). For example, it has been observed that automatic selection models such
as stepwise may not select the ”best” variables when there is near-collinearity. This
inference problems are mentioned as ”pre-test bias”, ”selection bias”, ”post-model
selection bias” (Hendry and Nielsen [2007], page 113).
How frequent is near-collinearity?
- For time series:
(TS1) Dynamic models with lags: r (xt , xt−1 ) could be large.
(TS2) Time series linear function on time (common coefficient = common trend,
or not), non linear function of time (cyclical component, seasonal component): two
regressors are function of a third factor (time).
- For times series and cross sections:
(TSCS1) Non linear in the variables model: polynomial models (quadratic, cubic,
quartic):
³ ´
- r x, xk increases when the mean of the observations is far from zero.
³ 0
´ ³ 0
´
- r x2k+1 , x2k +1 > r x2k+1 , x2k , the correlation of two odd (respectively two
even) powers is higher than the correlation of an odd with and even power, when the
mean of the sample observations is close to zero.
- Multiple polynomial: interaction terms. r (x1 , x1 x2 ) and r (x2 , x1 x2 ) are generally
high correlations.
(TSCS2) Endogeneity:
- one of the regressors is endogenous and function of the other: x3 = βb23 x2 + ε23 .
- regressors depends on a common (non measured) third factor, which is not nec-
essarily ”time”. For example, indicators measuring with errors more or less the same
phenomenon.
Estimated frequency: Near multicollinearity is likely to occur in one multiple re-
gression over five or ten.
5
The sum of the diagonal elements (the trace) of a correlation matrix is equal to
k because its diagonal elements are all equal to unity. The trace of a matrix is also
equal to the sum of its eigenvalues:
³ ´ i=k
X 1Xi=k
−1
trace (Rk ) = k = trace PΛP = trace (Λ) = λi ⇒ λi = 1
i=1 k i=1
Hence, the average value of the eigenvalues of a correlation matrix is equal to unity.
Because a correlation matrix is positive, all its eigenvalues are positive. Hence, if there
is one eigenvalue over unity, there exist necessarily another eigenvalue below unity.
The N × k matrix of mutually orthogonal principal components ZNk stands in the
following relation to XNk :
b 0 −1 −1
β P C = (ZNk ZNk ) Z0N k Y = (P0 X0Nk XNk P) P0 X0Nk YN
−1 −1 b
= P0 (X0Nk XNk ) PP0 X0Nk YN = P0 (X0Nk XNk ) X0Nk YN = P0 β
b
β 0b b b
P C = P β ⇔ β = Pβ P C
6
estimated standard error of the residuals), the coefficient of determination R2 , the
predictor xb1 are identical, in both the orthogonal and non-orthogonal regression (they
also exhibit the identical likelihood, see Hendry and Nielsen [2007], p.106). Hence, the
orthogonalization of regressors matters only for inference when selecting the relevant
variables (t-test).
The estimated variance of the estimated parameter for the ith principal compo-
nent is the root of the ith diagonal entry of the covariance matrix of parameters. It
represents an orthogonal decomposition of the variance of the estimates before the
orthogonalization:
³ ´
b −1 −1
V β 2 0
P C = σε · (ZNk ZNk ) = σε2 · (P0 X0N k XNk P) =σε2 · (NΛ)−1
Because the variance of the principal components related to near zero eigenvalues
is very small, the parameters of these principal components are likely to be not pre-
cisely estimated, except when the estimated standard error σbε2 is very small (or when
det (Rk+1 ) is very small).
Conversely, for the non orthogonal initial estimates, one has the variance orthog-
onal decomposition (provided by the option PROC REG, Model / COLLIN):
³ ´
−1 0 −1 1 j=k
X p2ij
σ 2
βb
i = σε2 · (X0N k XNk ) = σε2 · (NPΛP ) = σε2
N j=1 λj
j=k
X
Bounds: p2ij = 1, 0 ≤ p2ij ≤ 1
j=1
The properties of orthogonal matrix implies that p2ij are bounded. The parameters
will be more precisely estimated, when one deletes components with the smallest λj
which greatly increases the standard errors in case of near-collinearity.
The incomplete principal component regression (IPC-OLS) estimate of the matrix
βIP C is an estimate that is formed by deleting the effects of certain components in
estimating β. This amounts to replacing certain column orthonormal eigvenvectors of
P with zero vectors resulting in a new matrix P∗ . In this case, if the first regression is
the true model, the estimates will be biased and the estimated residuals are different.
There gain in the reduction of standard errors for the remaining components are not
always granted for two reasons:
- In some cases with more than 2 variables, p2ij may be very small as well.
- The omitted variable bias on βb
IP C due to the omission of some principal compo-
nents may increase the residuals, which may increase the mean square error (MSE)
used to estimate σε2 and determines an omitted variable bias on the estimated stan-
dard error of the estimated parameters of the remaining principal components. Then,
in some cases (see Anscombe, case 2), the omitted second principal component does
not reduce the standard error. This last problem is due to the fact that the deletion
7
decisions are made without information regarding the correlation between the depen-
dent variable and the regressors. In particular, in the trivariate case, there may be a
problem when r12 − r13 turns to be relatively large (for example over 0.03) along with
r23 > 0.95.
Let us check that the matrix ZSNk is orthogonal (its column vectors are mutually
orthogonal), which is the cae if the Gram matrix of the column vectors is diagonal
8
(Gram Schmidt orthogonalization procedure). If the Gram matrix of column vectors
is equal to the unit matrix, the variance of each column vector is equal to unity: the
column vectors are standardized in the sample, due to the term D−1/2 (Gram Schmidt
orthonormalization procedure):
µ ³ ´−1 ¶0 ³ ´−1 µ³ ´−1 ¶0 ³ ´−1 ³ ´−1
0 0 0 0 0 0
ZS0 S
Nk ZNk = XNk Lk XNk Lk = Lk X0Nk XNk Lk = L−1
k Lkk Lkk Lk = Ik
Substituting this equation in the regression equation for XNk , the relation between
the dependent variable and the regressors are:
³ 0
´−1
x1 = XN k β + ε = ZNk L β + ε = ZNk βGS + ε
b −1
β 0
GS = (ZNk ZNk ) Z0Nk Y = LX0 Y
³ 0
´−1 ³ 0
´−1 ³ 0
´−1
−1 −1 b
= L (X0k Xk ) L−1 LX0 Y = L (X0 X) X0 Y = L β
³ ´−1
b = L0 β
β b b 0
b
GS and β GS = L β
Solving the normal equations for the computation of OLS estimate knowing the
Choleski factorization N1 X0k Xk = LL0 amounts to solve two triangular systems (with
inverse matrix easy to compute):
0
b = LL β
(X0N k XNk ) β b = X0 Y
Nk
b 0
solve LβGS = XNk Y ⇒ β b −1 0 0
GS = L XNk Y = Z Y
b = β
b b 0 −1 b
then solve L0 β GS ⇒ β = (L ) βGS
9
³ ´
b = X0 Y ⇒ (ZL0 )0 ZL0 β
(X0Nk XNk ) β b = (ZL0 )0 Y ⇒
Nk
b = LZ0 Y ⇒ solve: L0 β
(LZ0 ZL0 ) β b =β
b 0
GS = Z Y
The estimated variance of the estimated parameter for the ith principal compo-
nent is the root of the ith diagonal entry of the covariance matrix of parameters. It
represents an orthogonal decomposition of the variance of the estimates before the
orthogonalization:
³ ´ ³ ´−1
b
V β 2 S0 S
= σε2 · (NI)−1
GS = σε · Zk Zk
On the other hand, for the non orthogonal initial estimates, one has the variance
orthogonal decomposition:
³ ´
−1 0 −1 1 j=k
X u21,ij
σ 2
βb
i = σε2 · (X0k Xk ) = σε2 · (NLL ) = σε2
N j=1 dj
This is a ”hierarchical” orthogonalization, in the sense that the full effect (the simple
correlation coefficient) is taken into account for the first variable and only the partial
effect for the second variable (the in sample residual of the auxiliary regression). By
contrast, the principal component is an egalitarian divide of the information provided
by each variables.
This is the method proposed by Hendry and Nielsen (Chapter 7, p.107): ”Orthog-
onalization requires a hierarchy of the regressors to determine which regressors should
be orthogonalized with respect to which. In most applications, econometric models will
have a richer structure than theory models. Such a framework could suggest a hierar-
chy”. ”The wage schooling relation gives an example, with the levels and squares of
schooling being highly correlated. Given the context, it seems more natural to orthogo-
nalize the square with respect to the linear term than vice versa” (Hendry and Nielsen
(p.137)). A linear model is simpler than a non linear model and may be given priority.
By contrast, Buse [1994] mentions that this hierarchical orthogonal model bias the
estimate βb12 of the first variable (by the −βb13”
r23 ) and bias its estimated standard
error (a biased and inconsistent estimator), because there is an omitted variable of
the component r23 x2 in the regression. However, the full effect of the first variable is
correct for ceteris paribus interpretation of the parameter βb1.2 .
Incomplete regressions with orthogonal regressors:
(1) Principal component regression: one decides to omit the principal component
axis with the lowest eigenvalue (the lowest share of variance) or with the lowest non
significant t-statistic. This is known as the ”Incomplete Principal Component Regres-
sion”.
(2) With the hierarchic orthogonal regression, one decides to omit one of the vari-
ables, i.e.: the residual of a regression between two near-collinear variables.
(3) One may omit directly one of the two variables.
10
5. The Trivariate Regression
5.1. Properties of the trivariate regression
Let us now consider the trivariate case (k = 2) for standardized variables. With near-
collinearity, precise regressions (relatively small standard errors and large t statistics)
are easily found. In these cases, a metric able to evaluate whether coefficients are
oversized is needed for authors, referes and journal editors. First, let us consider
standardized parameters:
x1 − x1 x2 − x2 b x3 − x3
= βb12 + β13 + ε1.23
σ1 σ2 σ3
The interpretation of a standardized parameter is as follows: a deviation from the
mean of the regressors x2 by one standard error σ2 of this regressor (that is: x2σ−x
2
2
= 1)
imply a prediction xb1 which deviates from the mean of the explained variable x1 by
βb12
S
times the standard error of the explained variable σ1 . In the case of the simple
regression, the standardized parameter is exactly equal to the correlation coefficient
r12 , which is such that |r12 | ≤ 1. In multiple regression with near collinearity, the
standardized parameter can easily exceed unity.
The determinant of correlation matrices, the coefficient of determination, the stan-
dardized coefficients, their standard errors, their t-statistics and their partial correla-
tion coefficients are given by:
2 2 2 2
0 ≤ det (R3 ) = 1 − r12 − r13 − r23 + 2r12 r13 r23 ≤ det (R2 ) = 1 − r23 ≤1
det (R3 ) 2 2
r + r13 − 2r12 r13 r23 ³ ´³ ´
2
R1.23 = 1− = 12 2
= 1 − 1 − r12 2 2
1 − r13.2
det (R2 ) 1 − r23
à ! " #à ! à !
βb12 1 1 −r23 r12 1 r12 − r13 r23
= =
βb13 2
1 − r23 −r23 1 r13 1 − r23 2 r13 − r12 r23
√ q v
2 u
MSE 1 − R1.23 1 u det (R3 )
t
σε = √
b = √ =√
N −2 N −2 N − 2 det (R2 )
à ! " # q à !
σb b
β1 1 1 −r23 det (R3 ) 1 1
= RMSE diag = √ 2
b
σβb2 det (R2 ) −r23 1 N − 2 1 − r23 1
à ! √ à !
tβb1 N −2 r12 − r13 r23
= q
tβb2 det (R3 ) r13 − r12 r23
q
2
r12 − r13 r23 t12 1 − r23
−1 < r12.3 = q = q = q βb12 < 1
2 2
(1 − r13 ) (1 − r23 ) t212 + N − 2 2
1 − r13
The relation between the partial correlation coefficient r12.3 (once the influence of
the second explanatory variable x3 is removed) and the t12 statistics for parameter βb12
is given in Greeene [2000], p.234-235.
11
In the trivariate case, the three collinearity indicators: the determinant det (R2 ),
the variance inflation factor (V IF ) and the condition index CI depend only on the
correlation coefficients between r23 :
à ! à !à !à !T
1 r23 0 √1 √1 1 − r23 0 √1 √1
R2 = = P2 Λ2 P2 = 2 2 2 2
r23 1 − √12 √1
2
0 1 + r23 − √12 √1
2
(5.1)
:With characteristic polynomial of R2 : X 2 − 2X + 1 − r23
2
. The matrix of normalized
eigenvectors is an orthogonal matrix, such that its inverse is equal to its transpose
0
P−1
2 = P2 . It corresponds here to a rotation of −45 degrees, with the following
normalized eigenvectors and their respective eigenvalues (either both eigenvalues are
equal to unity, or one eigenvalue is between 1 and 2 and the other one is between zero
and one):
Table 1: Eigenvalues, percentage of variance for each principal components and
normed eigenvectors of the correlation matrix R2 :
λmin + λmax = k = 2 λmin = 1 − r23 (PC2) λmax = 1 + r23 (PC1)
over/below unity 0 ≤ λmin = 2 − λmax = 1 − r23 ≤ 1 1 ≤ λmax = 1 + r23 ≤ 2 (if r23 ≥ 0)
λmin
% variance λmin +λmax
= 1−r2 23 λmax
λmin +λmax
= 1+r2 23
³ ´ ³ ´
x2 √1 = .70711 = cos π √1 = .70711 = cos π
2 4 ³ ´
2 4 ³ ´
x3 − √12 = −.70711 = − sin π4 − √12 = .70711 = sin π4
Let us now turn to the matrix R3 taking into account the orthogonalization ac-
cording to the principal component analysis of the regressors:
12
0
R3 = G3 M3 G3
⎛ ⎞ ⎛ ⎞⎛ r12√
−r13 r12√
+r13
⎞⎛ ⎞T
1 r12 r13 1 0 0 1 2 2
1 0 0
⎜ ⎟ ⎜ 1 ⎟ ⎜ r12 −r13 ⎟⎜ ⎟
⎝ r12 1 r23 ⎠ = ⎜ 0 √ √1 ⎟⎜ √ 1 − r23 0 ⎟⎜ 0 1
√ √1 ⎟
⎝ 2 2 ⎠⎝ 2 ⎠⎝ 2 2 ⎠
r13 r23 1 0 − √12 √1
2
r12√
+r13
2
0 1 + r23 0 − √12 √1
2
As demonstrated in the general case, the regression with all principal components
has the same residuals than the initial regression, with the following relation between
parameter estimates:
b
β 0b b b
P C = P β and β = Pβ P C
à ! à 1 !à ! ⎛ ⎞
b12√−βb13
β
βb1,P C2 √ − √12 βb12
= 2 =⎝ 2 ⎠
βb1,P C1 √1
2
√1
2 βb13 b12√+βb13
β
2
13
σx1 1 r12 + r13 1
βbP C1 = βbPS C1 = βb1,P
S
C1 √ ⇒ βbPS C1 = √ √
σ x2√+x3 1 + r23 2 1 + r23
2
The estimated variance of the estimated parameter for the ith principal component
is given by:
³ ´
b −1 −1
V β PC = σε2 · (Z0k Zk ) = σε2 · (P0 X0k Xk P) =σε2 · (NΛk )−1
µ ¶
V bS
β = σε2 · (NIk )−1
PC
√ q
³ √´ bS
β N − 2 2
1 − r23 r12 + r13
bS
t β1,P C1 = N 1,P C1
= q √ √
σε2 det (R3 ) 1 + r23 2
√ q
³ ´ √ βb1,P
S
N − 2 1 − r23 2
r12 − r13
bS
t β1,P C2 = N C1
= q √ √
σε2 det (R3 ) 1 − r23 2
14
The matrix R3 taking into account the Gram Schmidt orthogonalization of regres-
sors can be written as:
0
R3 = H3 N3 H3
⎛ ⎞ ⎛ ⎞⎛ ⎞⎛ ⎞T
1 r12 r13 1 0 0 1 r12 r13 − r12 r23 1 0 0
⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟
⎝ r12 1 r23 ⎠ = ⎝ 0 1 0 ⎠ ⎝ r12 1 0 ⎠⎝ 0 1 0 ⎠
2
r13 r23 1 0 r23 1 r13 − r12 r23 0 1 − r23 0 r23 1
b 0 −1 b
β GS = (L2 ) β
à ! à !à ! à ! à !
βbGS1 1 −r23 βb12 βb12 − βb13 r23 r12
= = = r13 −r12 r23 (5.6)
βbGS2 0 1 βb13 βb13 2
1−r23
√
³ √ βbGS1
´
S
N −2 q
βbS
t 1,P C1 = N 2 =q 2
1 − r23 r12
σε det (R3 )
√ q
³ ´ √ bS
βGS1 N − 2 1 − r23 2
r13 − r12 r23
bS
t β1,P C2 = N 2 =q q 2
σε 2
det (R3 ) 1 − r23 1 − r23
If the estimated parameter of ε32 is small and is not significantly different from
zero, then one removes this variable. The regression is then a constrained regression
assuming that the parameter of x2 is equal to zero.
15
5.4. A Comparison of Critical Regions for t statistics with or without or-
thogonalization of regressors.
In the four trivariate regressions, the RMSE is identical:
√ q v
2 u
MSE 1 − R1.23 1 u det (R3 )
t
σbε = √ = √ =√
N −2 N −2 N − 2 det (R2 )
But, the standardized parameters and the t-statistics are different. The VIF differs
from unity when there is no orthogonalization (table 3):
√
MSE √
t√ = βbS / V IF
N −k
Table 3: Standardized parameteres √ and t statistics.
√ √
b
β S
V IF √MSE t √MSE R2
√N−k2 N−k
1−r12 2
A: x r12 1 √ r12 r12
√ N −12
1−r13
B: x2 r13 1 √ r13 2
r13
√ N −12
2 1−R1.2 (r12 +r13 )2 1
√ √ 1
C: x+x 1+r23
r12√+r13 √ 1
1+r23
1 √ r12√ +r13 √ 1
1+r23 2 1+r23
2 2 √ N−12 2
2
x−x 2 1 r −r 1 1−R1.2 r −r 1 (r12 −r 13 ) 1
D: √2 √1−r23 12√ 13 √
1−r23
1 √ 12
√ 13 √
1−r23 2 1−r23
2 √ N−12 2
³ ´2
x2 −r23 x r13 −r12 r23 √ 1 1−R1.2 r13 −r12 r23 √ 1 r13 −r12 r23 1
E: √ 2 1−r232 2
1 √
N−1 1−r232 2 2
1−r23 2
1−r23
1−r23 1−r23 1−r23
√ 2 ³ ´
1−R1.2 2
x−r23 x2 r12 −r13 r23 √ 1 r12 −r13 r23 √ 1 r13 −r12 r23 1
F: √ 2 1−r 2 2
1 √
N−1 1−r 2 2 1−r 2 2
1−r23
1−r23 23 1−r23 23 1−r23 23
√ 2
r12 −r13 r23 1−R1.23
G1: x 1−r2
√1 2 √ r12 −r13 r23 √ 1
1−r2
2
R1.23
23 N−2
1−r23 2
23 1−r23
r13 −r12 r23 1 r13 −r12 r23 1
G2: x2 2
1−r23
√ 2
- 2
1−r23
√ 2
-
1−r23 1−r23
√
2 1−R21.23
√ √ 1
H1: x+x
2 1+r23
+r13 √ 1
r12√
2 1+r23
1 √
N−2
+r13 √ 1
r12√
2 1+r23
2
R1.23
2
H2: x−x
√ √ 1
1−r23
r12√
−r13 √ 1
1−r23
1 - r12√
−r13 √ 1
1−r23
-
2 2 √ 2
1−R21.23 2
I1: x r12 1 √
N−2
r12 R1.23
x2 −r23 x r13 −r12 r23 1 r13 −r12 r23 1
I2: √ 2 2
1−r23
√ 2
1 - 2
1−r23
√ 2
-
1−r23 1−r23 1−r23
√
1−R21.23
J1: x2 r13 1 √
N−2
r13 2
R1.23
x−r23 x2 r12 −r13 r23 1 r12 −r13 r23 1
J2: √ 2 2
1−r23
√ 2
1 - 2
1−r23
√ 2
-
1−r23 1−r23 1−r23
The discussion is with respect to:
A) The Gram Schmidt (with a hierarchy such as x is the first variable) leads to
reject the null hypothesis more frequently than the non orthogonalized when N > 100:
¯ ¯ √
¯ ¯
¯ r12 − r13 r23 1 ¯ MSE
t (G) = t (J) = ¯ q ¯ < 1.96 √ < t (I) = |r12 |
¯ 2 ¯
¯ 1 − r23 1 − r2 ¯ N −k
23
16
³ ´3
2 2
⇒ r12 − r13 r23 < r12 · 1 − r23 when r12 > 0
√ ⎛ ⎞
MSE r23
⇒ 1.96 √ < r12 < ⎝ 3
⎠ · r13 when r23 6= 0 and
N −k 2 2
1 − (1 − r23 )
r23 r23
3 ≥ 1 and lim 3 = 1
2 2 r23 →1 2 2
1 − (1 − r23 ) 1 − (1 − r23 )
B) The complete PC-OLS leads to reject the first principal component null hy-
pothesis more frequently than the non orthogonalized when:
√
r12 − r13 r23 1 MSE r12 + r13 1
t (G) = t (J) = 2
q < 1.96 √ < t (H) = √ √
1 − r23 2
1 − r23 N −k 2 1 + r23
3
2 2
r12 + r13 (1 − r23 )
⇒ r12 − r13 r23 < √ · √
2 1 + r23
⎛ ⎞
⎛ 3 ⎞
(1 − r2 ) 2 ⎜ 1 ⎟
⇒ r12 < r13 · ⎝ √ √ 23 + r23 ⎠ ⎜
⎜ 3
⎟
⎟
2 1 + r23 ⎝ (1−r2 ) 2 ⎠
1− √ √ 23
2 1+r23
check sign.
C) The complete PC-OLS leads to reject the first principal component null hy-
pothesis more frequently than the non orthogonalized when:
√
r12 − r13 r23 1 MSE r12 − r13 1
t (G) = t (J) = 2
q < 1.96 √ < t (H) = √ √
1 − r23 2
1 − r23 N −k 2 1 − r23
3
2 2
r12 − r13 (1 − r23 )
⇒ r12 − r13 r23 < √ · √
2 1 − r23
⎛ ⎞
⎛ 3 ⎞
(1 − r2 ) 2 ⎜ 1 ⎟
⇒ r12 < r13 · ⎝− √ √ 23 + r23 ⎠ ⎜
⎜ 3
⎟
⎟
2 1 − r23 ⎝ (1−r2 ) 2 ⎠
1− √ √ 23
2 1−r23
Then, orthogonalized regressors are able to select significant variables which are
not significantly different from zero when there is near-collinearity.
17
Table 3: Anscombe data set correlation matrix
r12 (x1 , x2 ) r13 (x1 , x22 ) r23 (x2 , x22 ) det (R2 ) det (R3 ) 2
R1.23 2
R1.2 2
= r12
1 0.81642 0.78466 0.98818 0.0235 0.00734 0.687 0.666
−6
2 0.81624 0.71801 0.98818 0.0235 (” − 3”) × 10 1 0.666
3 0.81629 0.82742 0.98818 0.0235 0.00741 0.685 0.666
4 0.81652 0.81652 1 0 0 − 0.666
In all cases, Anscombe constructed the data sets such that their sample correlation
coefficients are nearly identical r12 (x1 , x2 ) = 0.816. Hence, the four simple regression
have the same slope.
In the fourth case, the correlation r23 (x2 , x22 ) between x2 and x22 is equal to one,
because x2 has only two distinct numerical values (one corresponds to 10 observations
and the other one to 1 observations). The square of a binary variable remains a binary
variable. Hence, there is exact collinearity between the two regressors. Hence, one has
r12 = r13 . As well, the trivariate regression cannot be computed in this case, because
det (R2 ) = 0.
The observations of x2 are identical between the cases one, two and three. Hence,
the correlation coefficients r23 (x2 , x22 ) are identical and equal to 0.98818 > 0.95. There
is near-multicollinearity.
2 2
In cases 1 and 3, the gain in coefficient of determination R1.23 − R1.2 ≈ 0.02 is
small when adding the third variable (the quadratic term).
The principal components of the matrix of the regressors are:
Case r23 (x2 , x22 ) det (R2 ) λmax = 1 + r23 %V ar λmin = 1 − r23 %V ar
1, 2, 3 0.98818 0.0235 1.98818 99, 4% 0.01182 0.6%
4 1 0 2 100% 0 0%
The percentage of variance accounted for by each principal component are: 99.4%
and 0.06%. However, this does not necessarily mean that the second component does
not matter in a regression with x1 , as shown in case 2.
The variance of the OLS regression coefficients can be shown to be equal to the
residual variance (RMSE) multiplied by the SUM of the variance decomposition pro-
portions (VDP) of all eigenvalues. The most common threshold for a high VDP is of
VDP>0.50 associated with a high condition index:
s s s
λmax 1 + r23 1 + 0.98818
CI = = = = 12.969 > 10
λmin 1 − r23 1 − 0.98818
The variance decomposition is:
18
Let us investigate the variance of the parameters of the incomplete principal com-
ponents regression. Decomposing the variance of the non orthogonalized standard
error as follows shows that omitting the second principal component axis decreases
sharply the standard error of the remaining first principal component. Here, it elimi-
nates 99.4% of the variance decomposition proportion (cf. proportion of variance of the
other principal component) related to the smallest eigenvalue λ2 = 1 − r23 = 0.01182.
³ ´ RMSE √
σb βbS = √ V IF =
N −1
6.5222
RMSE · = RMSE · 2.0625 with near-multicollinearity
3.1623
1
or RMSE · = RMSE · 0.31623 with orthogonalized variables
3.1623
Regression A is the simple regression. Regression B is the incomplete principal
component regression keeping the first principal component. Regression C is the
trivariate regression with multicollinearity. Regression D is the complete PC-OLS
regression.
- Simples regressions A, B and C are very close. Taking the sum of the two variables
(regression C) leads to a averaging of regressions A and B, with a small decrease of
R2 with respect to the best regression A.
- Comparing regression A with regression E: the standardized parameter of the
variable x is multiplied by 2.1381 (it is over one, it cannot be interpreted as a
19
√
”ceteris paribus” effect) but its standard error is multiplied by 6.6572 = V IF ·
RMSE(C)/RMSE(A): it is no longer significant. As the RMSE is similar in model
A and model C, the increase of the estimated standard error is due to the variance
inflation factor (42.5 times the variance
√ of a model where variables are orthogonal, it
multiplies the standard error by V IF = 6.5222 ). For the trivariate regression, the
estimated standard error of standardized parameters
are the same for both parameters (this is indeed not the case for non-standardized
parameters, where the standard error is multiplied by σ (x1 ) /σ (xi )). The parameter
of x2 is negative (may be an unexpected sign), large and close of the difference for
the parameter of x in regression Its standard error is relatively large, and it is not
significantly different from zero. The gain of R2 is very small.
- Comparing model A with model D: the standardized coefficient is close to the sim-
ple regression, with a similar t statistic (t statistics are identical for standardized and
non standardized parameters). The second principal component is not significantly
different from zero.
- Comparing model E with model A and model D: the standardized coefficient is
identical to the simple regression. However, its estimated standard error is identical to
the one of the PC-OLS. Then its t-statistics is very small. The standardized orthogonal
residual of the square of x2 with respect to x has a negative and relatively small
standardized coefficient. It is not significantly different from zero.
A general to specific specification search based on t-test without orthogonalization
starting by regression C would lead to reject both explanatory variables for the next
step. By contrast, an orthogonalized regression (D or E) would lead to keep the first
principal component and to eliminate the second principal component.
Note that the reverse error could be true (cf. Roodman [2008] example with
aid/GDP and aid/GPD*tropics): the t-statistics could lead to keep both explanatory
variables, whereas in an orthogonalized regression, only one of them or none of them
is accepted.
POWER: The problem of near-multicollinearity with respect to the power of t
tests is the following one:
- In regression D: both t-test have low power, because the R2 (full model)−R2 (restricted
model) is small for both variables.
- By contrast, in hierarchical orthogonalized regressions: for the favoured vari-
able, its t-test have high power because the R2 (full model)−R2 (restricted model) is
large. For the residual variable, it has the same power than in the non-orthogonalized
regression D.
- With the complete PC-OLS regression, the second principal component has a
power of its t-test which is between the power of each of the residual variable in the
hierarchical model.
20
6.2. Case 2: Non Linear Quadratic model
Table 5: Anscombe’s true quadratic
³ model
´ regressions (A, B, C, D):
βbS
σb βbS V IF RMSE t Power R2
A: x 0.81624 0.18269 1 0.57772 4.47 0.986 0.6662
B: x2 0.71801 0.22011 1 0.69604 3.26 0.860 0.5155
x+x 2 1
C: √2 √1+r23 0.76940 0.20200 1 0.63877 3.81 0.943 0.5920
D: x 4.53964 0.00160 42.53846 0.00077613 2835.93 >.999 1
D: x2 -3.76796 0.00160 42.53846 - -2353.90 >.999 -
x+x 2 1
E: √2 √1+r23 0.76940 0.00024543 1 0.00077613 3134.85 - 1
2
E: x−x
√ √ 1
2 1−r23
0.63877 0.00024543 1 - 2602.60 >.999 -
F: x 0.81624 0.00024543 1 0.00077613 3325.68 - 1
x2 −r23 x
√
F: 2
-0.57772 0.00024543 1 - -2353.9 >.999 -
1−r23
2
G: x 0.71801 0.00024543 1 0.00077613 2925.46 - 1
x−r23 x2
G: √ 2
0.69603 0.00024543 1 - 2835.93 >.999 -
1−r23
- Simples regressions A, B and C are very close. Taking the sum of the two variables
(regression C) leads to a averaging of regressions A and B, with a small decrease of
R2 with respect to the best regression A.
- Comparing regression A with regression C: the standardized parameter of the
variable x is multiplied by 5.5616 (it is over one, it cannot be√interpreted as a ”ceteris
paribus” effect). Its standard error is multiplied by 6.6572 = V IF ·RMSE(C)/RMSE(A):
it is highly significantly different from zero. As the RMSE is close to zero in model
C, the estimated standard error of this standardized parameter is close to zero despite
the high variance inflation factor (42.5 times the variance
√ of a model where variables
are orthogonal, it multiplies the standard error by V IF = 6.5222 ). The parame-
ter of x2 is negative, large and close of the difference for the parameter of x in the
simple regression. Its standard error is identical to the one of the other standardized
parameter. It is highly significantly different from zero. The gain in coefficient of
2 2
determination R1.23 − R1.2 ≈ 0.33 is large when adding the variable x22 in the model,
whereas it does not matter in the cases 1 and 3. In case 2, the model corresponds to
a nearly exact regression det (R3 ) ≈ 0, with negligible residuals,negligible root mean
square error and a coefficient of determination equal to unity.
- Comparing model A with model D: the standardized coefficient is close to the
simple regression, with a similar t statistic (t statistics are identical for standardized
and non standardized parameters). The second principal component has a relatively
high standardized parameter and it is significantly different from zero.
- Comparing model E with model A and model D: the standardized coefficient is
identical to the simple regression. However, its estimated standard error is identical
to the one of the PC-OLS and very small. Then its t-statistics is very large. The
standardized orthogonal residual of the square of x2 with respect to x has a negative
21
and relatively small standardized coefficient. It is not significantly different from zero.
A general to specific specification search based on t-test without orthogonalization
starting by regression C would lead to rightly accept both explanatory variables, as
well as a complete PC-OLS regression.
Note that including the second principal component is easier to obtain than in-
cluding the Gram Schmidt residual variable, because the information content between
each variable is less asymmetric.
22
6.4. Case 4: Quadratic model with near zero bivariate effects and large and
precise trivariate effects
The dependent variable is now the standardized residual x1 − r12 x2 from case 2. This
residual has a zero sample correlation with the variable x2 .
Table 7: Quadratic model with near zero bivariate effects and large and precise
trivariate effects ³ ´
βbS σb βbS V IF RMSE t Power R2
A: x 0 0.31623 1 1.00000 0.00 0
2
B: x -0.15332 0.31249 1 0.98818 -0.49 0.0235
x+x 2 1
C: √2 √1+r23 -0.07689 0.31529 1 0.99704 -0.24 0.0059
x−x 2 1
D: √2 √1−r23 0.99704 0.02432 1 0.07690 41.00 0.9941
2
E: (x − r23 x) /σ -1 0.00040303 1 0.00127 -2481.2 1
F: (x − r23 x2 ) /σ 0 0.98818 0.04849 1 0.15333 20.38 0.9765
G1: x 6.44503 0.00277 42.53846 0.00134 2326.03 1
2
G2: x -6.52215 0.00277 42.53846 - -2353.9 -
x+x 2 1
H1: 2 1+r23
√ √ -0.07689 0.00042483 1 0.00134 -180.99 1
x−x 2 1
H2: √2 √1−r23 0.99704 0.00042483 1 - 2346.89 -
I1: x 0 0.00042483 1 0.00134 0.00 - 1
2
I2: (x − r23 x) /σ -1 0.00042483 1 - -2353.9 -
2
J1: x -0.15332 0.00042483 1 0.00134 -360.90 - 1
2 0
J2: (x − r23 x ) /σ 0.98818 0.00042483 1 - 2326.03 -
In this case, simple regression effects are zero or small. However, because r12 −r13 is
relatively large so that det (R3 ) = 0, the trivariate model leads to a perfect correlation.
It nonetheless casts doubts on a few non linear models (square or interaction term)
where one of the variable has a near zero correlation with the dependent variable
(r12 ≈ 0), such as in the growth and aid/gdp literature (Doucouliagos and Paldam
[2008]). With data mining trying a dozen of interacting variable with x2 , it is possible
to find an interaction term x4 variable such that x3 = x4 x2 such that: r12 − r13 > 0.05
which may be sufficiently large to turn a near zero effect in bivariate regression as a
large trivariate effect in a precise trivariate regression with near-multicollinearity.
- The variable x3 −r23 x2 is exactly negatively correlated with the dependent variable
x1 − r12 x2 (the partial correlation of x3 and x1 is equal to unity in Anscombe’s case
2).
- The PC-OLS leads to focus on the second principal component, which represents
the smallest part of the variance (0.6%) of the couple of explanatory variables (x2 , x3 ),
but explains 99.4% of the variance of the dependent variable x1 .
- The hierarchical orthogonalization second term has the explains most part of the
variance. As r23 is close to unity, x2 − r23 x is close to the difference x2 − x.
- As a consequence, orthogonalization reveals an odd property (the second compo-
nents turns to be the most significant effect in the regression).
23
- The powers of the t-test is very large for both variables in the trivariate regression,
whereas it is close to zero in the bivariate regressions.
However, as seen in the next section, the ”ceteris paribus” interpretation in case
of multicollinearity is not valid.
In this case, the results to be chosen depend on substantive meaning. Has the
difference between the variable x−x2 a substantive meaning, despite that it represents
only 0.6% of the variance of the group of explanatory variables? Are the variations of
the variable x − x2 due to outliers?
24
One third of the observations of x2 implies extreme predictions of xb1 in the 5%
tails of the distribution of x1 . These are unreliable ”catastrophic” forecasts.
However, the fact that, because near-collinearity, the Ceteris Paribus interpretation
is not valid does not mean that the regression is not relevant. Both parameters are
precisely estimated because the residuals, their sum of squares and the root mean
square error are close to zero: it more than offsets the rise of variance due to near-
collinearity via the VIF when computing the estimated standard error of the estimated
parameters.
With this respect, Buse [1984] omitted variable bias interpretation is correct. How-
ever, the ceteris paribus interpretation is not valid, whereas in the Gram Schmidt
orthogonalization, the ceteris paribus interpretation on the coefficient of x is correct.
Back to the original parameters is always possible, even with incomplete PC-OLS.
25
2
such that: R1.j > 1%). It refers to Cohen [1988] classification of effects in his overall
2
evaluation of the power of tests for cross sections: r1j = 0.1 or r1j = 1%: small effect,
2 2
r1j = 0.3 or r1j = 9% medium size effect, r1j = 0.5 or r1j = 25% large effect. This
test is based on Fisher’s Z transform available in many statistical software computing
correlation matrix. For example, using SAS software, the instruction is: “proc corr
data=database fischer (rho0=0.1 lower);”. One may also apply Mayo and Spanos
[2006] severe testing procedure, which is more restrictive.
In a second step, he may look for stratification (Hendry and Nielsen [2008]), that
is finding subsets of observations (defined by a dummy variable 1subset ) such that
|r (x1 , x2 · 1subset )| ≥ 0.1 for the excluded variables such that |r (x1 , x2 )| ≤ 0.1. This
dummy variable for the selection of observations may be itself endogenous and may
depend on variables zj , or on the variable x2 itself (quadratic model) so that interaction
terms may matter |r (x1 , x2 · zj )| ≥ 0.1 as well. The key point is to exclude x2 in
the general model before the general to specific selection of regressors, so that the
interaction terms x2 · zj does not turn to be significantly different from zero due to
the near-collinearity regression.
An applied researcher may evaluate with care whether the selected observations
driving the correlation r (x1 , x2 · zj ) are only a few observations (outliers): in this
case the initial robust selection of observations is likely to perform much better:
r (x1 , x2 · 1subset ) > r (x1 , x2 · zj ). Because sometimes, it may be more interesting to
find general results than stating particular cases, one may be tempted to favor regres-
sions with x2 · zj becausethey are ”general conditional effect”, despite the fact that
they are not robust to the removal of a few outliers. As shown in Roodman [2008]
example: an editor would favour a general conditional argument ”aid is correlated
with growth for non tropical countries” than ”large aid/gdp with high growth rate is
found in Jordan, Egypt and Botswana”.
7. Conclusion
This note presented orthogonalized regressors and constrained estimations. Present-
ing the correlation matrix including the explained variable, the VIF and the confi-
dence interval of each estimated parameters, discussing the plausibility of the size
of the (standardized) parameters and explaining the data generating process of the
collinearity between regressors is useful.
References
[1] Anscombe A. [1973]. Graphs in Statistical Analysis. American Statistician, 27,
pp. 17-21.
[2] Buse A. [1994]. Brickmaking and the collinear arts: a cautionary tale. Canadian
Journal of Economics. Revue Canadienne d’Economie. 27(2), pp.408-414.
26
[3] Castle J. [2008]. Econometric Model Selection: Nonlinear Techniques and Fore-
casting, Saarbruumlcken: VDM Verlag. ISBN: 978-3-639-00458-8
[4] Castle J. and Hendry D. [2008]. A Low-Dimension Portmanteau Test for Non-
linearity (with David F. Hendry), Working paper No. 326, Economics Depart-
ment, University of Oxford.
[5] Castle J. and Hendry D. [2008]. Extending the Boundaries of Automatic Model
Selection: Non-linear Models’, Working paper, Economics Department, Univer-
sity of Oxford.
[6] Cohen [1988]. Statistical Power Analysis for the Behavioral Sciences.
[7] Doucouliagos H. and Paldam M. [2009]. The aid effectiveness literature: The
sad results of 40 years of research. Journal of Economic Surveys, forthcoming.
Working paper, M. Paldam’s webpage.
[9] Hendry D.F. and Nielsen B. [2007]. Econometric Modelling. A Likelihood Ap-
proach. Princeton University Press.
[11] Puntanen S. and Styan G.P.H. [2005]. Schur complements in statistics and prob-
ability. In The Schur complement and its applications, F. Zhang editor, Springer
Verlag, pp. 163-226.
[12] Roodman D. [2008]. Through the Looking Glass, and What OLS Found There:
On Growth, Foreign Aid, and Reverse Causality. Center for Global Development
Working Paper Number 137.
[14] Spanos [2006]. Revisiting the omitted variables argument: Substantive vs. statis-
tical adequacy. Journal of Economic Methodology, Volume 13, Issue 2, pages 179
- 218
[15] Stanley, T.D., [2005]. Beyond publication bias. Journal of Economic Surveys 19,
309-45.
27
7.1. Constrained Regressions and Near-Collinearity (Burnside and Dollar)
Not for publication. An answer to near-collinearity is to constrain parameters.
0
βb13 = 0, βb13 = 0, βb13
”
=0 (7.1)
(1) Principal component regression: one decides to omit the principal component
axis with the lowest eigenvalue (the lowest share of variance) or with the lowest non
significant t-statistic. This is known as the ”Incomplete Principal Component Regres-
sion”.
(2) With the hierarchic orthogonal regression, one decides to omit one of the vari-
ables, i.e.: the residual of a regression between two near-collinear variables.
(3) One may omit directly one of the two variables.
Let us mention a fourth constraint on parameters leading to unbiased parameter
estimations with a constrained parameter. In a first step, one estimated the regression
with near-collinear variables:
28