You are on page 1of 28

Near-Collinearity and the Orthogonalization of

Regressors: Anscombe’s Quartet New Insights on


Tests for Non-Linearity
Jean-Bernard Chatelain∗and Kirsten Ralf†
March 30, 2009

Abstract
This note presents differents orthogonalizations of regressors facing near-
collinearity and constrained parameters regressions.
JEL classification: C10, C12
Keywords: Collinearity, Student t Statistic.

1. Introduction
Near-collinearity (or near-multicollinearity or collinearity) between explanatory vari-
ables is defined for high values of multiple correlations between explanatory variables
in multiple regression. Near collinearity does not invalidate the assumptions of ordi-
nary least squares, as long as the collinearity is not perfect: in this case, the estimator
cannot be computed. Near-collinearity may lead to high values of the estimated pa-
rameters, high values of the estimated variance of each parameters, and low values of
the t statistics. Near-collinearity is a problem for the selection of variables.
As a consequence, a general to specific specification search usually orthogonalizes
the regressors, in order to select relevant variables. However, there exists different
methods for the orthogonalization of regressors, leading to different t-statistics and
different inference on selected variables. This paper deals with two of these orthogo-
nalization methods: the Gram Schmidt Choleski hierarchical orthogonalization, where
the order of the variables matters in the orthogonalization, and a more ”egalitarian”
treatment of regressors: the standard principal component analysis. Note that Hall
and Fomby [2003] and Buse [1984] mention that the parameter of the favoured regres-
sor in the Gram Schmidt orthogonalization method may be biased and inconsistent.

Centre d’Economie de la Sorbonne (CES), Paris School of Economics, and CEPREMAP, Uni-
versity Paris 1 Pantheon Sorbonne, E-mail: jean-bernard.chatelain@univ-paris1.fr

ESCE, Graduate School of International Trade, Paris, CEPREMAP, E-mail: kirsten.ralf@esce.fr
On the other hand, Hendry and Nielsen [2007] mention that a hierarchy is explained
by substantive reasons, and propose to use it for quadratic models.
A particular case of models plagued by near-multicollinearity are multiple poly-
nomial models, such as quadratic models with interaction terms (Hendry, Krölzig,
etc.). Multiple polynomial models of order 2, 3 or 4 are a local approximation of more
general non-linear models. As such, there have been used as a specification to test
for non-linear model (White [1980], Castle and Hendry [2009]). Castle and Hendry
[2008] propose a low dimension collinearity robust test for non linearity. In a ingenu-
ous test, Castle and Hendry [2008] orthogonalize the quadratic and interaction terms
using the principal component analysis, which greatly reduces the dimensionality of
the White test [1980]. A potential extension of their test is to orthogonalized first all
the quadratic and interaction terms with respect to all the linear terms (favored by
Hendry and Nielsen [2007]) and then go on with these residuals for their tests.
We propose here a non-linear test where all the orthogonalizations possible for
all the possible hierarchy or orders of the Gram-Schmidt-Choleski procedure and the
Principal Component are used: the specification which is chosen is the one with the
highest t-statistic. This note presents differents orthogonalizations of regressors facing
near-collinearity and constrained parameters regressions.
To evaluate the procedure, we use a famous data set with four polar cases using
Anscombe’s quartet. We use Anscombe first three small samples of 11 observations
with the same correlation coefficient of the dependent variable and the regressor,
and the same correlation coefficient of the regressor and its square (and a high near-
multicolllinearity, but not an exact one, which is the case of Anscombe fourth sample).
The first sample is related to a true linear model, the second to a true quadratic model,
the third to a true linear model with an outlier, and we construct a fourth one of a
true quadratic model with zero or near-zero correlation of the dependant variable with
the regressor and the square of the regressor (described as a near-spurious regression”,
in a companion paper (Chatelain and Ralf [2009]). The outcome of the procedure are
as follows.
- In the first case, a pair of non orthogonalized near-collinear would be eliminated
in a general to specific approach (with a power of the t test very low, because the
gain in R2 by adding one regressor with respect to the simple regression is very small)
whereas the procedure that we propose would retain only the linear term, which is
prefered to the first principal component and to the quadratic term only.
- In the second case, a pair of non orthogonalized near-collinear is kept in a general
to specific approach, as well as both orthogonalized regressors in the three method (2
gram Schmidt and the complete PC-OLS). A regression with near-collinear variable
may be very precise, because the root mean square error of the residual is very small,
so that orthogonalization is not necessary.
- In the third case, the method selects the square term only model (a non linear
model). However, this selection procedure fails to notice that the non linear model is
chosen only because of an outlier. The true model is a linear one with an outlier. A

2
model with the linear term and a dummy for the outlier outperforms the quadratic
model by a gain of 32% of variance for the coefficient of determination.
- In the fourth case, the method selects both regressors, with a large power of the
t-test (the gain in R square when adding any of the two regressors with respect to the
simple regression is near unity). However, this results presents a substantive paradox.
First, each of the regressors explain 0% or 2% of the variance of the dependent vari-
able, whereas together they explain 100% of the variance of the dependent variable.
Second, the second principal component or the residual of one regressors with respect
to the other (both new variables are close to the difference of the two near-collinear
regressors), which accounts for 0.6% of the total variance of the two regressors, is
able to explain 99.4% of the variance of the dependent variable. This challenges the
commonly held view that the difference of a pair of near-collinear variables is usually
not precisely estimated (e.g; Verbeek [2000], p.40). The selection of the relevant or-
thogonalized variable is the second axis (with lowest variance of the regressors, and
hence, a large standard error of its parameters) instead of the first one as in other
cases (1, 2 and 3 ) is a particular property of near-spurious regressions. It is then a
substantive issue to state whether the difference of there two variables makes sense in
such a regression. This raises light on a old debate with respect to selecting principal
components on the bases of the eigenvalues or on the basis on t-statistics (Jolliffe). In
a SAS book, the insight proposed was that there was outliers leading to specific effects,
a mixture of the problem identified in case 2 (with r12 large) and the near-spurious
regression of case 4.
Another section investigates the ”ceteris paribus”. For example, Verbeek [2000]
suggests that the ”ceteris paribus” interpretation is not valid in case of near-multicollinearity.
This section goes further using the usual interpretation of standardized coefficients
when they exceed unity. In this case, a ceteris paribus interpretation over-forecast
extreme tails of the distribution of the dependent variable. In practice, full effect
simple correlation are reliable to evaluate the effect of near-collinear variables. The
logical consequences of Verbeek’s points is the following: the ceteris paribus is only
exactly valid with orthogonalized regressors, or with complete simple regression ef-
fects and not with partial correlation effects. The ”ceteris paribus” interpretation of
standardized coefficients larger than unity is dubious.
Finally, the paper discusses in practice how cases 2 and case 4 in various literature:
investment and polynomial adjustment costs, aid and growth literature.
The paper proceeds as follows. Section 1 presents the orthogonalization of regres-
sors. Section 2 presents various constrained parameters regressions

2. Near-collinearity: Definition
Let us consider a regression on standardized variables (hence, there is no constant in
the model). Bold letters corresponds to matrices and vectors.
x1 = Xk β + ε

3
where x1 is the vector of N observations of the explained variable, X.k is the matrix
where column i corresponds to the N observations of the regressor xi for 2 ≤ i ≤ k +1,
βb is a vector of k parameters to be estimated, and ε is a vector of random disturbances
following a normal distribution of mean zero and of variance σ.
Let us denote Rk+1 a block sample correlation matrix between all variables, includ-
ing the explained variable on the first row and column. The matrix Rk corresponds
to the correlation matrix of the regressors. One has rij2 ≤ 1 for 1 ≤ i ≤ k + 1 and
1 ≤ j ≤ k + 1.
" 0
#
1 r1 1 0
Rk+1 = with Rk = X Xk = [rij ] 2≤i≤k+1 and with r1 = X0k x1 = [r1i ]2≤i≤k+1
r1 Rk N k 2≤j≤k+1

A correlation matrix is positive. Its determinant is such that 0 ≤ det (Rk+1 ) ≤ 1,


for all values of k ≥ 1. Let us denote Rk+1 /Rk the Schur complement of the matrix
Rk+1 defined by:
0
Rk+1 /Rk = 1 − r1 R−1 2
k r1 = 1 − R1.23...k ≥ 0

Because r1 is a one column matrix, the Schur complement Rk+1 /Rk is a scalar.
It is equal to one minus the coefficient of determination of the multiple correlation
2
R1.23...k , so that: 0 ≤ Rk+1 /Rk ≤ 1. A property of Schur complement is (Puntanen
and Styan [2005]):
³ ´
2
0 ≤ det (Rk+1 ) = det (Rk+1 /Rk ) · det (Rk ) = 1 − R1.23...k+1 det (Rk ) ≤ det (Rk ) ≤ 1

Let us define:
(1) Exact collinearity between regressors for det (Rk ) = 0.
(2) Near-collinearity between regressors when 0 < det (Rk ) ≤ δ < 1, with δ is
relatively small, defined by a rule of thumb such as δ = 0.1.
(3) An exact multiple regression (exact collinearity between the explained variable
and its regressors) for det (Rk+1 ) = 0 and det (Rk ) 6= 0.
Problems related to near-collinearity:
(1) large (or small) estimated parameters.
(2) large (or small) estimated standard errors.
(3) small (or large) t statistics.
(4) small (or large) coefficient of determination.
(5) Sensitivity of the removal or the addition of observations in the sample: possible
large variation on all estimates, possible change of signs of the parameters, while the
parameters remains large in both the positive and the negative case.
(6) Poor forecasting properties out of the sample of estimation, while the model
may look good in the sample. This property is also related to some extent with
”over-fitting”: too many regressors increases the probability of near-collinearity.
(7) In automatic model selection, with near collinearity between relevant and irrel-
evant regressors, it would become difficult to correctly eliminate irrelevant regressors.

4
This can however be avoided through orthogonalization. The order of the orthogonal-
ization will matter, and results in different selected models, which could be compared us-
ing encompassing and evaluated in the substantive context (Hendry and Nielsen [2007],
page 297). For example, it has been observed that automatic selection models such
as stepwise may not select the ”best” variables when there is near-collinearity. This
inference problems are mentioned as ”pre-test bias”, ”selection bias”, ”post-model
selection bias” (Hendry and Nielsen [2007], page 113).
How frequent is near-collinearity?
- For time series:
(TS1) Dynamic models with lags: r (xt , xt−1 ) could be large.
(TS2) Time series linear function on time (common coefficient = common trend,
or not), non linear function of time (cyclical component, seasonal component): two
regressors are function of a third factor (time).
- For times series and cross sections:
(TSCS1) Non linear in the variables model: polynomial models (quadratic, cubic,
quartic):
³ ´
- r x, xk increases when the mean of the observations is far from zero.
³ 0
´ ³ 0
´
- r x2k+1 , x2k +1 > r x2k+1 , x2k , the correlation of two odd (respectively two
even) powers is higher than the correlation of an odd with and even power, when the
mean of the sample observations is close to zero.
- Multiple polynomial: interaction terms. r (x1 , x1 x2 ) and r (x2 , x1 x2 ) are generally
high correlations.
(TSCS2) Endogeneity:
- one of the regressors is endogenous and function of the other: x3 = βb23 x2 + ε23 .
- regressors depends on a common (non measured) third factor, which is not nec-
essarily ”time”. For example, indicators measuring with errors more or less the same
phenomenon.
Estimated frequency: Near multicollinearity is likely to occur in one multiple re-
gression over five or ten.

3. Complete and Incomplete Principal Component Regression


Model
Let P be the k × k matrix consisting of k × 1 orthonormal eigenvectors pij (1 ≤ i ≤ k)
of the correlation matrix N1 X0Nk XNk (we do not mention the indices of dimensions for
k × k matrix). The matrix P is orthogonal: P−1 = P0 . A diagonal k × k matrix of
eigenvalues Λ is such that:
1 0 ³ √ ´ ³√ ´ 0
XNk XNk = PΛP0 = P Λ ΛP0 = P1 P1
N

5
The sum of the diagonal elements (the trace) of a correlation matrix is equal to
k because its diagonal elements are all equal to unity. The trace of a matrix is also
equal to the sum of its eigenvalues:
³ ´ i=k
X 1Xi=k
−1
trace (Rk ) = k = trace PΛP = trace (Λ) = λi ⇒ λi = 1
i=1 k i=1

Hence, the average value of the eigenvalues of a correlation matrix is equal to unity.
Because a correlation matrix is positive, all its eigenvalues are positive. Hence, if there
is one eigenvalue over unity, there exist necessarily another eigenvalue below unity.
The N × k matrix of mutually orthogonal principal components ZNk stands in the
following relation to XNk :

ZNk = XN k P and XNk = ZNk P0

The correlation matrix of mutually orthogonal principal components is:


1 0 1
ZNk ZNk = (XNk P)0 XNk PXNk = Λk
N N
The variances of each principal component is equal their respective eigenvalue.
The principal components related to near zero eigenvalues have very little variance.
The principal components (each of the column vectors of the matrix Zk ) could be
standardized to have unit standard errors and unit variance by premultiplying by the
−1
inverse of the diagonal matrix of standard deviations, namely Λk 2 :
à !
−1 z1 zk
ZSNk = Λk 2 XNk P = √ , ..., √ with ZS0 S
N k ZNk = N · Ik
λ1 λk
Substituting this equation in the regression equation for X, the relation between
the dependent variable

x1 = XNk β + ε = ZNk P0 β + ε = ZNk βP C + ε

where β P C is the k × 1 vector of population parameters corresponding to the


principal components Zk . Using properties of orthogonal matrix, the OLS estimates
of the orignial parameters in equation (1) can be formed as:

b 0 −1 −1
β P C = (ZNk ZNk ) Z0N k Y = (P0 X0Nk XNk P) P0 X0Nk YN
−1 −1 b
= P0 (X0Nk XNk ) PP0 X0Nk YN = P0 (X0Nk XNk ) X0Nk YN = P0 β
b
β 0b b b
P C = P β ⇔ β = Pβ P C

Hence, the estimated residuals εb do not change after orthogonalisation includ-


ing ALL the principal components. The residuals, the root mean square error (the

6
estimated standard error of the residuals), the coefficient of determination R2 , the
predictor xb1 are identical, in both the orthogonal and non-orthogonal regression (they
also exhibit the identical likelihood, see Hendry and Nielsen [2007], p.106). Hence, the
orthogonalization of regressors matters only for inference when selecting the relevant
variables (t-test).
The estimated variance of the estimated parameter for the ith principal compo-
nent is the root of the ith diagonal entry of the covariance matrix of parameters. It
represents an orthogonal decomposition of the variance of the estimates before the
orthogonalization:
³ ´
b −1 −1
V β 2 0
P C = σε · (ZNk ZNk ) = σε2 · (P0 X0N k XNk P) =σε2 · (NΛ)−1

Because the variance of the principal components related to near zero eigenvalues
is very small, the parameters of these principal components are likely to be not pre-
cisely estimated, except when the estimated standard error σbε2 is very small (or when
det (Rk+1 ) is very small).
Conversely, for the non orthogonal initial estimates, one has the variance orthog-
onal decomposition (provided by the option PROC REG, Model / COLLIN):

³ ´
−1 0 −1 1 j=k
X p2ij
σ 2
βb
i = σε2 · (X0N k XNk ) = σε2 · (NPΛP ) = σε2
N j=1 λj
j=k
X
Bounds: p2ij = 1, 0 ≤ p2ij ≤ 1
j=1

The properties of orthogonal matrix implies that p2ij are bounded. The parameters
will be more precisely estimated, when one deletes components with the smallest λj
which greatly increases the standard errors in case of near-collinearity.
The incomplete principal component regression (IPC-OLS) estimate of the matrix
βIP C is an estimate that is formed by deleting the effects of certain components in
estimating β. This amounts to replacing certain column orthonormal eigvenvectors of
P with zero vectors resulting in a new matrix P∗ . In this case, if the first regression is
the true model, the estimates will be biased and the estimated residuals are different.
There gain in the reduction of standard errors for the remaining components are not
always granted for two reasons:
- In some cases with more than 2 variables, p2ij may be very small as well.
- The omitted variable bias on βb
IP C due to the omission of some principal compo-
nents may increase the residuals, which may increase the mean square error (MSE)
used to estimate σε2 and determines an omitted variable bias on the estimated stan-
dard error of the estimated parameters of the remaining principal components. Then,
in some cases (see Anscombe, case 2), the omitted second principal component does
not reduce the standard error. This last problem is due to the fact that the deletion

7
decisions are made without information regarding the correlation between the depen-
dent variable and the regressors. In particular, in the trivariate case, there may be a
problem when r12 − r13 turns to be relatively large (for example over 0.03) along with
r23 > 0.95.

4. Choleski and QR factorization: Gram Schmidt Orthogonal-


ization of the Linear Model
Orthogonalization may be an intermediate computation of matrix numerical analysis
involved in OLS estimates. Two numerical matrix algorithms are used for solving
the normal equations of estimates, the unique Choleski factorization or the ”QR”
factorization related to Gram Schmidt orthogonalization.
Let L1 be a k × k lower (or left) triangular matrix consisting of k × 1 vectors u1,ij
(1 ≤ i ≤ k), with diagonal elements equal to unity u1,ii = 1. The Choleski factorization
of the positive definite symmetric covariance matrix N1 X0k Xk is the product of a lower
triangular matrix L1 , of a diagonal matrix Dk with positive diagonal elements, and
0
the upper triangular matrix L1 (which is the transpose of the lower triangular matrix
L1 ):
1 0 0
³ √ ´ ³√ 0
´ 0
XNk XNk = L1 DL1 = L1 D DL1 = LL
N
The matrix L is a k × k lower triangular matrix consisting of k × 1 vectors √ u1,ij
(1 ≤ i ≤ k), where all diagonal elements are strictly positive and equal to u1,ii = dk .
If rank (Xk ) = k, then L and its transpose are invertible. The diagonal matrix
differs from the diagonal matrix of eigenvalues, but the trace and the determinants
of both matrices are identical. The transition matrix is lower triangular instead of
being orthogonal and including normalized eigenvectors as in the principal components
orthogonalization. The matrix Dkk is diagonal and easy to invert. Using the rule for
the inverse of a matrix products, we can then compute:
µ ¶
−1 ³ 0 ´−1 ³ −1 ´0
1 0
XNk XN k = L1 D−1 L−1
1 = L1 D−1 L−1 1
N
Remark: the inverse of a lower triangular matrix is a lower triangular matrix.
However, in general, a lowerr triangular matrix is not necessarily orthogonal (such
that L0 = L−1 ).
The N ×k matrix of the vectors of observations of orthogonal variables ZNk stands
in the following relation to the matrix of the vectors of observations of the (non-
orthogonal) variables XN k (matrix numerical√ analysis textbooks uses the notation
0
QR factorization instead of our notation: Z DL1 ):
³ 0 ´−1 ³√ ´
0 −1 0 √ 0
ZSN k = XNk L = XNk DL1 ⇔ XNk = ZSNk L = ZSNk DL1

Let us check that the matrix ZSNk is orthogonal (its column vectors are mutually
orthogonal), which is the cae if the Gram matrix of the column vectors is diagonal

8
(Gram Schmidt orthogonalization procedure). If the Gram matrix of column vectors
is equal to the unit matrix, the variance of each column vector is equal to unity: the
column vectors are standardized in the sample, due to the term D−1/2 (Gram Schmidt
orthonormalization procedure):
µ ³ ´−1 ¶0 ³ ´−1 µ³ ´−1 ¶0 ³ ´−1 ³ ´−1
0 0 0 0 0 0
ZS0 S
Nk ZNk = XNk Lk XNk Lk = Lk X0Nk XNk Lk = L−1
k Lkk Lkk Lk = Ik

Substituting this equation in the regression equation for XNk , the relation between
the dependent variable and the regressors are:
³ 0
´−1
x1 = XN k β + ε = ZNk L β + ε = ZNk βGS + ε

where βGS is the k × 1 vector of population parameters corresponding to the


Gram Schmidt orthogonalization ZN k . Using properties of orthogonal matrix, the
OLS estimates of the original parameters in equation (1) can be formed as:

b −1
β 0
GS = (ZNk ZNk ) Z0Nk Y = LX0 Y
³ 0
´−1 ³ 0
´−1 ³ 0
´−1
−1 −1 b
= L (X0k Xk ) L−1 LX0 Y = L (X0 X) X0 Y = L β
³ ´−1
b = L0 β
β b b 0
b
GS and β GS = L β

Hence, the estimated residuals εb do not change after orthogonalisation including


ALL the orthogonalized explanatory variables. The predictions xb1 are not changed,
as well as all statistics using estimated residuals such as R2 , RMSE, and Durbin-
Watson statistics for time series. It is only the parameter estimates and the estimated
standard errors which benefit from orthogonalization.
Note that:
b −1
β 0
GS = (ZNk ZNk ) Z0Nk Y = D−1 Z0Nk Y

Solving the normal equations for the computation of OLS estimate knowing the
Choleski factorization N1 X0k Xk = LL0 amounts to solve two triangular systems (with
inverse matrix easy to compute):

0
b = LL β
(X0N k XNk ) β b = X0 Y
Nk
b 0
solve LβGS = XNk Y ⇒ β b −1 0 0
GS = L XNk Y = Z Y
b = β
b b 0 −1 b
then solve L0 β GS ⇒ β = (L ) βGS

Hence, the parameters of orthogonal are an intermediate numerical computation


ofb
β in most softwares. Solving the normal equations for the computation of OLS
estimate knowing the ”QR” factorization XNk = ZNk L0 amounts to solve only the last
triangular systems of the Choleski method (with inverse matrix easy to compute):

9
³ ´
b = X0 Y ⇒ (ZL0 )0 ZL0 β
(X0Nk XNk ) β b = (ZL0 )0 Y ⇒
Nk
b = LZ0 Y ⇒ solve: L0 β
(LZ0 ZL0 ) β b =β
b 0
GS = Z Y

The estimated variance of the estimated parameter for the ith principal compo-
nent is the root of the ith diagonal entry of the covariance matrix of parameters. It
represents an orthogonal decomposition of the variance of the estimates before the
orthogonalization:
³ ´ ³ ´−1
b
V β 2 S0 S
= σε2 · (NI)−1
GS = σε · Zk Zk

On the other hand, for the non orthogonal initial estimates, one has the variance
orthogonal decomposition:
³ ´
−1 0 −1 1 j=k
X u21,ij
σ 2
βb
i = σε2 · (X0k Xk ) = σε2 · (NLL ) = σε2
N j=1 dj

This is a ”hierarchical” orthogonalization, in the sense that the full effect (the simple
correlation coefficient) is taken into account for the first variable and only the partial
effect for the second variable (the in sample residual of the auxiliary regression). By
contrast, the principal component is an egalitarian divide of the information provided
by each variables.
This is the method proposed by Hendry and Nielsen (Chapter 7, p.107): ”Orthog-
onalization requires a hierarchy of the regressors to determine which regressors should
be orthogonalized with respect to which. In most applications, econometric models will
have a richer structure than theory models. Such a framework could suggest a hierar-
chy”. ”The wage schooling relation gives an example, with the levels and squares of
schooling being highly correlated. Given the context, it seems more natural to orthogo-
nalize the square with respect to the linear term than vice versa” (Hendry and Nielsen
(p.137)). A linear model is simpler than a non linear model and may be given priority.
By contrast, Buse [1994] mentions that this hierarchical orthogonal model bias the
estimate βb12 of the first variable (by the −βb13”
r23 ) and bias its estimated standard
error (a biased and inconsistent estimator), because there is an omitted variable of
the component r23 x2 in the regression. However, the full effect of the first variable is
correct for ceteris paribus interpretation of the parameter βb1.2 .
Incomplete regressions with orthogonal regressors:
(1) Principal component regression: one decides to omit the principal component
axis with the lowest eigenvalue (the lowest share of variance) or with the lowest non
significant t-statistic. This is known as the ”Incomplete Principal Component Regres-
sion”.
(2) With the hierarchic orthogonal regression, one decides to omit one of the vari-
ables, i.e.: the residual of a regression between two near-collinear variables.
(3) One may omit directly one of the two variables.

10
5. The Trivariate Regression
5.1. Properties of the trivariate regression
Let us now consider the trivariate case (k = 2) for standardized variables. With near-
collinearity, precise regressions (relatively small standard errors and large t statistics)
are easily found. In these cases, a metric able to evaluate whether coefficients are
oversized is needed for authors, referes and journal editors. First, let us consider
standardized parameters:
x1 − x1 x2 − x2 b x3 − x3
= βb12 + β13 + ε1.23
σ1 σ2 σ3
The interpretation of a standardized parameter is as follows: a deviation from the
mean of the regressors x2 by one standard error σ2 of this regressor (that is: x2σ−x
2
2
= 1)
imply a prediction xb1 which deviates from the mean of the explained variable x1 by
βb12
S
times the standard error of the explained variable σ1 . In the case of the simple
regression, the standardized parameter is exactly equal to the correlation coefficient
r12 , which is such that |r12 | ≤ 1. In multiple regression with near collinearity, the
standardized parameter can easily exceed unity.
The determinant of correlation matrices, the coefficient of determination, the stan-
dardized coefficients, their standard errors, their t-statistics and their partial correla-
tion coefficients are given by:

2 2 2 2
0 ≤ det (R3 ) = 1 − r12 − r13 − r23 + 2r12 r13 r23 ≤ det (R2 ) = 1 − r23 ≤1
det (R3 ) 2 2
r + r13 − 2r12 r13 r23 ³ ´³ ´
2
R1.23 = 1− = 12 2
= 1 − 1 − r12 2 2
1 − r13.2
det (R2 ) 1 − r23
à ! " #à ! à !
βb12 1 1 −r23 r12 1 r12 − r13 r23
= =
βb13 2
1 − r23 −r23 1 r13 1 − r23 2 r13 − r12 r23
√ q v
2 u
MSE 1 − R1.23 1 u det (R3 )
t
σε = √
b = √ =√
N −2 N −2 N − 2 det (R2 )
à ! " # q à !
σb b
β1 1 1 −r23 det (R3 ) 1 1
= RMSE diag = √ 2
b
σβb2 det (R2 ) −r23 1 N − 2 1 − r23 1
à ! √ à !
tβb1 N −2 r12 − r13 r23
= q
tβb2 det (R3 ) r13 − r12 r23
q
2
r12 − r13 r23 t12 1 − r23
−1 < r12.3 = q = q = q βb12 < 1
2 2
(1 − r13 ) (1 − r23 ) t212 + N − 2 2
1 − r13
The relation between the partial correlation coefficient r12.3 (once the influence of
the second explanatory variable x3 is removed) and the t12 statistics for parameter βb12
is given in Greeene [2000], p.234-235.

11
In the trivariate case, the three collinearity indicators: the determinant det (R2 ),
the variance inflation factor (V IF ) and the condition index CI depend only on the
correlation coefficients between r23 :

λmax = 1 + r23 and λmax = 1 − r23


2
det (R2 ) = λmax λmin = 1 − r23
1 1 1
V IF = 2
= =
1 − r23 det (R2 ) λmax λmin
s s
λmax 1 + r23
CI = =
λmin 1 − r23
where λmax = 1 + r23 and λmin = 1 − r23 are the two eigenvalues of the correlation
matrix of the regressors R2 (see next section on the principal components orthogo-
nalization of regressors). We assume from now on r23 ≥ 0, the alternative leading to
symmetric results easy to find . Hence, near-collinearity can be defined by a unique
rule of thumb such as r23 ≥ 0.95 so that det (R2 ) < 0.1 or V IF > 10 or CI > 6.24.

5.2. The Standard Principal Components Orthogonalizations of Regressors


The principal component analysis leads to the following orthogonalization of regres-
sors:

à ! à !à !à !T
1 r23 0 √1 √1 1 − r23 0 √1 √1
R2 = = P2 Λ2 P2 = 2 2 2 2
r23 1 − √12 √1
2
0 1 + r23 − √12 √1
2
(5.1)
:With characteristic polynomial of R2 : X 2 − 2X + 1 − r23
2
. The matrix of normalized
eigenvectors is an orthogonal matrix, such that its inverse is equal to its transpose
0
P−1
2 = P2 . It corresponds here to a rotation of −45 degrees, with the following
normalized eigenvectors and their respective eigenvalues (either both eigenvalues are
equal to unity, or one eigenvalue is between 1 and 2 and the other one is between zero
and one):
Table 1: Eigenvalues, percentage of variance for each principal components and
normed eigenvectors of the correlation matrix R2 :
λmin + λmax = k = 2 λmin = 1 − r23 (PC2) λmax = 1 + r23 (PC1)
over/below unity 0 ≤ λmin = 2 − λmax = 1 − r23 ≤ 1 1 ≤ λmax = 1 + r23 ≤ 2 (if r23 ≥ 0)
λmin
% variance λmin +λmax
= 1−r2 23 λmax
λmin +λmax
= 1+r2 23
³ ´ ³ ´
x2 √1 = .70711 = cos π √1 = .70711 = cos π
2 4 ³ ´
2 4 ³ ´
x3 − √12 = −.70711 = − sin π4 − √12 = .70711 = sin π4
Let us now turn to the matrix R3 taking into account the orthogonalization ac-
cording to the principal component analysis of the regressors:

12
0
R3 = G3 M3 G3
⎛ ⎞ ⎛ ⎞⎛ r12√
−r13 r12√
+r13
⎞⎛ ⎞T
1 r12 r13 1 0 0 1 2 2
1 0 0
⎜ ⎟ ⎜ 1 ⎟ ⎜ r12 −r13 ⎟⎜ ⎟
⎝ r12 1 r23 ⎠ = ⎜ 0 √ √1 ⎟⎜ √ 1 − r23 0 ⎟⎜ 0 1
√ √1 ⎟
⎝ 2 2 ⎠⎝ 2 ⎠⎝ 2 2 ⎠
r13 r23 1 0 − √12 √1
2
r12√
+r13
2
0 1 + r23 0 − √12 √1
2

The Givens’ orthogonal matrix G3 corresponds to a rotation of −45 degrees of


normalized eigenvectors in the plane orthogonal to the first eigenvector and determined
by the second and third eigenvectors. One has:
³ ´
det (R3 ) = det (G3 ) det (D3 ) det G−1
3 = det (D3 )
à !2 à !2
r12 + r13 r12 − r13
= det (R2 ) − (1 − r23 ) √ − (1 + r23 ) √ < det (R2 )
2 2
One has:
³ ´³ ´ ³ ´
det (R3 ) = 2
1 − r12 2
1 − r23 − (r13 − r12 r23 )2 = − (r13 − r13 ) r13 − r13 > 0 ⇒
q q
2 2 2 2
−1 ≤ r12 r23 − (1 − r12 ) (1 − r23 ) = r13 ≤ r13 ≤ r13 = r12 r23 + (1 − r12 ) (1 − r23 ≤1
)(5.2)

As demonstrated in the general case, the regression with all principal components
has the same residuals than the initial regression, with the following relation between
parameter estimates:

b
β 0b b b
P C = P β and β = Pβ P C
à ! à 1 !à ! ⎛ ⎞
b12√−βb13
β
βb1,P C2 √ − √12 βb12
= 2 =⎝ 2 ⎠
βb1,P C1 √1
2
√1
2 βb13 b12√+βb13
β
2

One can compute the parameters of principal components as function of correlation


coefficients:

βb12 + βb13 r12 + r13 1


βb1,P C1 = √ = √ (5.3)
2 2 1 + r13
b b
β12 − β13 r12 − r13 1
βb1,P C2 = √ = √ (5.4)
2 2 1 − r13
Because each principal component is related to a different variance which differs
from unity, the parameters of the complete principal component regression βb1,P C1 are
0

not standardized. βb1,P


S0
C1 is the standardized parameter of the first principal compo-
nent. By definition, it is related to the non standardized parameter βb12 by:

13
σx1 1 r12 + r13 1
βbP C1 = βbPS C1 = βb1,P
S
C1 √ ⇒ βbPS C1 = √ √
σ x2√+x3 1 + r23 2 1 + r23
2

σx1 1 r12 − r13 1


βbP C2 = βbPS C2 = βbPS C2 √ ⇒ βbPS C2 = √ √
σ x2√−x3 1 − r23 2 1 − r23
2

The estimated variance of the estimated parameter for the ith principal component
is given by:

³ ´
b −1 −1
V β PC = σε2 · (Z0k Zk ) = σε2 · (P0 X0k Xk P) =σε2 · (NΛk )−1
µ ¶
V bS
β = σε2 · (NIk )−1
PC

The Student’s t statistics:

√ q
³ √´ bS
β N − 2 2
1 − r23 r12 + r13
bS
t β1,P C1 = N 1,P C1
= q √ √
σε2 det (R3 ) 1 + r23 2
√ q
³ ´ √ βb1,P
S
N − 2 1 − r23 2
r12 − r13
bS
t β1,P C2 = N C1
= q √ √
σε2 det (R3 ) 1 − r23 2

Let us comment the results:


(1) For a criteria of near-collinearity such that r23 > 0.95, the parameter βb1,P C1
for the first principal component (which accounts for more than 97.5% of the inertia
(variance) of the correlation matrix of both explanatory variables) is much smaller in
the orthogonalized regression. For the second principal component (which accounts for
a small amount of variance in this case (< 0.05/2 = 2.5%), the results are ambiguous:
the parameter could be large when r12 − r13 > 1 − r13 .
(2) The estimated standard error of each parameter is smaller in the orthogonalized
case. In the first case, the V IF = 1−r1 2 whereas in the second case the V IF = 1.
23
(3) A conjecture is that the estimated parameter may be less sensitive to the
removal of one observation in the orthogonalized case, but statistics of DF-Betas are
not convincing.

5.3. The Gram Schmidt Hierarchical Orthogonalizations of regressors


The Choleski hierarchical decomposition keeping the first variable unchanged is:
à ! à !à !à !T
1 r23 0 1 0 1 0 1 0
R2 = = L2 D2 L2 = 2 (5.5)
r23 1 r23 1 0 1 − r23 r23 1

14
The matrix R3 taking into account the Gram Schmidt orthogonalization of regres-
sors can be written as:

0
R3 = H3 N3 H3
⎛ ⎞ ⎛ ⎞⎛ ⎞⎛ ⎞T
1 r12 r13 1 0 0 1 r12 r13 − r12 r23 1 0 0
⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟
⎝ r12 1 r23 ⎠ = ⎝ 0 1 0 ⎠ ⎝ r12 1 0 ⎠⎝ 0 1 0 ⎠
2
r13 r23 1 0 r23 1 r13 − r12 r23 0 1 − r23 0 r23 1

The hierarchical orthogonalization of regressors computes orthogonal vectors one


after the other. In the trivariate case with standardized variables, one computes the
in-sample orthogonal vector using the residuals of OLS auxiliary regression between
regressors:

x3 = r23 x2 + ε32 ⇒ cov (x3 − r23 x2 , x2 ) = 0


b :
One has the non standardized parameters β GS

b 0 −1 b
β GS = (L2 ) β
à ! à !à ! à ! à !
βbGS1 1 −r23 βb12 βb12 − βb13 r23 r12
= = = r13 −r12 r23 (5.6)
βbGS2 0 1 βb13 βb13 2
1−r23

with the standardized parameter:


r13 − r12 r23 1
βbGS2
S
= 2
q
1 − r23 2
1 − r23

The Student’s t statistics:


³ √ βbGS1
´
S
N −2 q
βbS
t 1,P C1 = N 2 =q 2
1 − r23 r12
σε det (R3 )
√ q
³ ´ √ bS
βGS1 N − 2 1 − r23 2
r13 − r12 r23
bS
t β1,P C2 = N 2 =q q 2
σε 2
det (R3 ) 1 − r23 1 − r23

If the estimated parameter of ε32 is small and is not significantly different from
zero, then one removes this variable. The regression is then a constrained regression
assuming that the parameter of x2 is equal to zero.

15
5.4. A Comparison of Critical Regions for t statistics with or without or-
thogonalization of regressors.
In the four trivariate regressions, the RMSE is identical:
√ q v
2 u
MSE 1 − R1.23 1 u det (R3 )
t
σbε = √ = √ =√
N −2 N −2 N − 2 det (R2 )
But, the standardized parameters and the t-statistics are different. The VIF differs
from unity when there is no orthogonalization (table 3):

MSE √
t√ = βbS / V IF
N −k
Table 3: Standardized parameteres √ and t statistics.
√ √
b
β S
V IF √MSE t √MSE R2
√N−k2 N−k
1−r12 2
A: x r12 1 √ r12 r12
√ N −12
1−r13
B: x2 r13 1 √ r13 2
r13
√ N −12
2 1−R1.2 (r12 +r13 )2 1
√ √ 1
C: x+x 1+r23
r12√+r13 √ 1
1+r23
1 √ r12√ +r13 √ 1
1+r23 2 1+r23
2 2 √ N−12 2
2
x−x 2 1 r −r 1 1−R1.2 r −r 1 (r12 −r 13 ) 1
D: √2 √1−r23 12√ 13 √
1−r23
1 √ 12
√ 13 √
1−r23 2 1−r23
2 √ N−12 2
³ ´2
x2 −r23 x r13 −r12 r23 √ 1 1−R1.2 r13 −r12 r23 √ 1 r13 −r12 r23 1
E: √ 2 1−r232 2
1 √
N−1 1−r232 2 2
1−r23 2
1−r23
1−r23 1−r23 1−r23
√ 2 ³ ´
1−R1.2 2
x−r23 x2 r12 −r13 r23 √ 1 r12 −r13 r23 √ 1 r13 −r12 r23 1
F: √ 2 1−r 2 2
1 √
N−1 1−r 2 2 1−r 2 2
1−r23
1−r23 23 1−r23 23 1−r23 23
√ 2
r12 −r13 r23 1−R1.23
G1: x 1−r2
√1 2 √ r12 −r13 r23 √ 1
1−r2
2
R1.23
23 N−2
1−r23 2
23 1−r23
r13 −r12 r23 1 r13 −r12 r23 1
G2: x2 2
1−r23
√ 2
- 2
1−r23
√ 2
-
1−r23 1−r23

2 1−R21.23
√ √ 1
H1: x+x
2 1+r23
+r13 √ 1
r12√
2 1+r23
1 √
N−2
+r13 √ 1
r12√
2 1+r23
2
R1.23
2
H2: x−x
√ √ 1
1−r23
r12√
−r13 √ 1
1−r23
1 - r12√
−r13 √ 1
1−r23
-
2 2 √ 2
1−R21.23 2
I1: x r12 1 √
N−2
r12 R1.23
x2 −r23 x r13 −r12 r23 1 r13 −r12 r23 1
I2: √ 2 2
1−r23
√ 2
1 - 2
1−r23
√ 2
-
1−r23 1−r23 1−r23

1−R21.23
J1: x2 r13 1 √
N−2
r13 2
R1.23
x−r23 x2 r12 −r13 r23 1 r12 −r13 r23 1
J2: √ 2 2
1−r23
√ 2
1 - 2
1−r23
√ 2
-
1−r23 1−r23 1−r23
The discussion is with respect to:
A) The Gram Schmidt (with a hierarchy such as x is the first variable) leads to
reject the null hypothesis more frequently than the non orthogonalized when N > 100:
¯ ¯ √
¯ ¯
¯ r12 − r13 r23 1 ¯ MSE
t (G) = t (J) = ¯ q ¯ < 1.96 √ < t (I) = |r12 |
¯ 2 ¯
¯ 1 − r23 1 − r2 ¯ N −k
23

16
³ ´3
2 2
⇒ r12 − r13 r23 < r12 · 1 − r23 when r12 > 0
√ ⎛ ⎞
MSE r23
⇒ 1.96 √ < r12 < ⎝ 3
⎠ · r13 when r23 6= 0 and
N −k 2 2
1 − (1 − r23 )
r23 r23
3 ≥ 1 and lim 3 = 1
2 2 r23 →1 2 2
1 − (1 − r23 ) 1 − (1 − r23 )

B) The complete PC-OLS leads to reject the first principal component null hy-
pothesis more frequently than the non orthogonalized when:


r12 − r13 r23 1 MSE r12 + r13 1
t (G) = t (J) = 2
q < 1.96 √ < t (H) = √ √
1 − r23 2
1 − r23 N −k 2 1 + r23
3
2 2
r12 + r13 (1 − r23 )
⇒ r12 − r13 r23 < √ · √
2 1 + r23
⎛ ⎞
⎛ 3 ⎞
(1 − r2 ) 2 ⎜ 1 ⎟
⇒ r12 < r13 · ⎝ √ √ 23 + r23 ⎠ ⎜
⎜ 3


2 1 + r23 ⎝ (1−r2 ) 2 ⎠
1− √ √ 23
2 1+r23

check sign.
C) The complete PC-OLS leads to reject the first principal component null hy-
pothesis more frequently than the non orthogonalized when:


r12 − r13 r23 1 MSE r12 − r13 1
t (G) = t (J) = 2
q < 1.96 √ < t (H) = √ √
1 − r23 2
1 − r23 N −k 2 1 − r23
3
2 2
r12 − r13 (1 − r23 )
⇒ r12 − r13 r23 < √ · √
2 1 − r23
⎛ ⎞
⎛ 3 ⎞
(1 − r2 ) 2 ⎜ 1 ⎟
⇒ r12 < r13 · ⎝− √ √ 23 + r23 ⎠ ⎜
⎜ 3


2 1 − r23 ⎝ (1−r2 ) 2 ⎠
1− √ √ 23
2 1−r23

Then, orthogonalized regressors are able to select significant variables which are
not significantly different from zero when there is near-collinearity.

6. Example: Anscombe’s Quartet


Anscombe four samples leads to the following correlation coefficients for x1 and the
regressors x2 and the square value of x22 :

17
Table 3: Anscombe data set correlation matrix
r12 (x1 , x2 ) r13 (x1 , x22 ) r23 (x2 , x22 ) det (R2 ) det (R3 ) 2
R1.23 2
R1.2 2
= r12
1 0.81642 0.78466 0.98818 0.0235 0.00734 0.687 0.666
−6
2 0.81624 0.71801 0.98818 0.0235 (” − 3”) × 10 1 0.666
3 0.81629 0.82742 0.98818 0.0235 0.00741 0.685 0.666
4 0.81652 0.81652 1 0 0 − 0.666
In all cases, Anscombe constructed the data sets such that their sample correlation
coefficients are nearly identical r12 (x1 , x2 ) = 0.816. Hence, the four simple regression
have the same slope.
In the fourth case, the correlation r23 (x2 , x22 ) between x2 and x22 is equal to one,
because x2 has only two distinct numerical values (one corresponds to 10 observations
and the other one to 1 observations). The square of a binary variable remains a binary
variable. Hence, there is exact collinearity between the two regressors. Hence, one has
r12 = r13 . As well, the trivariate regression cannot be computed in this case, because
det (R2 ) = 0.
The observations of x2 are identical between the cases one, two and three. Hence,
the correlation coefficients r23 (x2 , x22 ) are identical and equal to 0.98818 > 0.95. There
is near-multicollinearity.
2 2
In cases 1 and 3, the gain in coefficient of determination R1.23 − R1.2 ≈ 0.02 is
small when adding the third variable (the quadratic term).
The principal components of the matrix of the regressors are:
Case r23 (x2 , x22 ) det (R2 ) λmax = 1 + r23 %V ar λmin = 1 − r23 %V ar
1, 2, 3 0.98818 0.0235 1.98818 99, 4% 0.01182 0.6%
4 1 0 2 100% 0 0%
The percentage of variance accounted for by each principal component are: 99.4%
and 0.06%. However, this does not necessarily mean that the second component does
not matter in a regression with x1 , as shown in case 2.
The variance of the OLS regression coefficients can be shown to be equal to the
residual variance (RMSE) multiplied by the SUM of the variance decomposition pro-
portions (VDP) of all eigenvalues. The most common threshold for a high VDP is of
VDP>0.50 associated with a high condition index:
s s s
λmax 1 + r23 1 + 0.98818
CI = = = = 12.969 > 10
λmin 1 − r23 1 − 0.98818
The variance decomposition is:

1 p211 p212 0.5 0.5 p211


2
= + = + >>
1 − r23 λ1 λ2 1 + r23 1 − r23 λ1
1 1 1
2
= 0.5 + 0.5
1 − 0.98818 1 + 0.98818 1 − 0.98818
42.553 = 0.5 · 0.50297 + 0.5 · 84.602 = 0.25149 + 42.301 = 0.6% + 99.4%
= V DP (λ1 ) + V DP (λ2 )

18
Let us investigate the variance of the parameters of the incomplete principal com-
ponents regression. Decomposing the variance of the non orthogonalized standard
error as follows shows that omitting the second principal component axis decreases
sharply the standard error of the remaining first principal component. Here, it elimi-
nates 99.4% of the variance decomposition proportion (cf. proportion of variance of the
other principal component) related to the smallest eigenvalue λ2 = 1 − r23 = 0.01182.

6.1. Case 1: Linear model.


Let us compare the results for case 1 (without intercept) with four regressions (A, B,
C, D,E):
Table 4: Anscombe’s true linear model
³ ´ regressions:
b
β S
σb βbS V IF RMSE t Power R2
A: x 0.81642 0.18261 1 0.57746 4.47 0.986 0.6665
B: x2 0.78466 0.19604 1 0.61992 4.00 0.961 0.6157
x+x 2 1
C: √2 √1+r23 0.80292 0.18850 1 0.59602 4.26 0.977 0.6447
D: x 1.74560 1.21567 42.53846 0.58942 1.44 0.288 0.6873
2
D: x -0.94030 1.21567 42.53846 - -0.77 0.118 -
x+x 2 1
E: √2 √1+r23 0.80292 0.18639 1 0.58942 4.31 - 0.6873
x−x 2 1
E: √2 √1−r23 0.20652 0.18639 1 - 1.11 0.191 -
F: x 0.81642 0.18639 1 0.58942 4.38 - 0.6873
F: (x2 − r23 x) /σ -0.14417 0.18639 1 - -0.77 0.118 -
G: x2 0.78466 0.18639 1 0.58942 4.21 - 0.6873
2
G: (x − r23 x ) /σ 0.26764 0.18639 1 1.44 0.288
One has (with k = 1 (regressions A,B,C,D) or k = 2 (other regressions)) with SAS:

³ ´ RMSE √
σb βbS = √ V IF =
N −1
6.5222
RMSE · = RMSE · 2.0625 with near-multicollinearity
3.1623
1
or RMSE · = RMSE · 0.31623 with orthogonalized variables
3.1623
Regression A is the simple regression. Regression B is the incomplete principal
component regression keeping the first principal component. Regression C is the
trivariate regression with multicollinearity. Regression D is the complete PC-OLS
regression.
- Simples regressions A, B and C are very close. Taking the sum of the two variables
(regression C) leads to a averaging of regressions A and B, with a small decrease of
R2 with respect to the best regression A.
- Comparing regression A with regression E: the standardized parameter of the
variable x is multiplied by 2.1381 (it is over one, it cannot be interpreted as a

19

”ceteris paribus” effect) but its standard error is multiplied by 6.6572 = V IF ·
RMSE(C)/RMSE(A): it is no longer significant. As the RMSE is similar in model
A and model C, the increase of the estimated standard error is due to the variance
inflation factor (42.5 times the variance
√ of a model where variables are orthogonal, it
multiplies the standard error by V IF = 6.5222 ). For the trivariate regression, the
estimated standard error of standardized parameters
are the same for both parameters (this is indeed not the case for non-standardized
parameters, where the standard error is multiplied by σ (x1 ) /σ (xi )). The parameter
of x2 is negative (may be an unexpected sign), large and close of the difference for
the parameter of x in regression Its standard error is relatively large, and it is not
significantly different from zero. The gain of R2 is very small.
- Comparing model A with model D: the standardized coefficient is close to the sim-
ple regression, with a similar t statistic (t statistics are identical for standardized and
non standardized parameters). The second principal component is not significantly
different from zero.
- Comparing model E with model A and model D: the standardized coefficient is
identical to the simple regression. However, its estimated standard error is identical to
the one of the PC-OLS. Then its t-statistics is very small. The standardized orthogonal
residual of the square of x2 with respect to x has a negative and relatively small
standardized coefficient. It is not significantly different from zero.
A general to specific specification search based on t-test without orthogonalization
starting by regression C would lead to reject both explanatory variables for the next
step. By contrast, an orthogonalized regression (D or E) would lead to keep the first
principal component and to eliminate the second principal component.
Note that the reverse error could be true (cf. Roodman [2008] example with
aid/GDP and aid/GPD*tropics): the t-statistics could lead to keep both explanatory
variables, whereas in an orthogonalized regression, only one of them or none of them
is accepted.
POWER: The problem of near-multicollinearity with respect to the power of t
tests is the following one:
- In regression D: both t-test have low power, because the R2 (full model)−R2 (restricted
model) is small for both variables.
- By contrast, in hierarchical orthogonalized regressions: for the favoured vari-
able, its t-test have high power because the R2 (full model)−R2 (restricted model) is
large. For the residual variable, it has the same power than in the non-orthogonalized
regression D.
- With the complete PC-OLS regression, the second principal component has a
power of its t-test which is between the power of each of the residual variable in the
hierarchical model.

20
6.2. Case 2: Non Linear Quadratic model
Table 5: Anscombe’s true quadratic
³ model
´ regressions (A, B, C, D):
βbS
σb βbS V IF RMSE t Power R2
A: x 0.81624 0.18269 1 0.57772 4.47 0.986 0.6662
B: x2 0.71801 0.22011 1 0.69604 3.26 0.860 0.5155
x+x 2 1
C: √2 √1+r23 0.76940 0.20200 1 0.63877 3.81 0.943 0.5920
D: x 4.53964 0.00160 42.53846 0.00077613 2835.93 >.999 1
D: x2 -3.76796 0.00160 42.53846 - -2353.90 >.999 -
x+x 2 1
E: √2 √1+r23 0.76940 0.00024543 1 0.00077613 3134.85 - 1
2
E: x−x
√ √ 1
2 1−r23
0.63877 0.00024543 1 - 2602.60 >.999 -
F: x 0.81624 0.00024543 1 0.00077613 3325.68 - 1
x2 −r23 x

F: 2
-0.57772 0.00024543 1 - -2353.9 >.999 -
1−r23
2
G: x 0.71801 0.00024543 1 0.00077613 2925.46 - 1
x−r23 x2
G: √ 2
0.69603 0.00024543 1 - 2835.93 >.999 -
1−r23
- Simples regressions A, B and C are very close. Taking the sum of the two variables
(regression C) leads to a averaging of regressions A and B, with a small decrease of
R2 with respect to the best regression A.
- Comparing regression A with regression C: the standardized parameter of the
variable x is multiplied by 5.5616 (it is over one, it cannot be√interpreted as a ”ceteris
paribus” effect). Its standard error is multiplied by 6.6572 = V IF ·RMSE(C)/RMSE(A):
it is highly significantly different from zero. As the RMSE is close to zero in model
C, the estimated standard error of this standardized parameter is close to zero despite
the high variance inflation factor (42.5 times the variance
√ of a model where variables
are orthogonal, it multiplies the standard error by V IF = 6.5222 ). The parame-
ter of x2 is negative, large and close of the difference for the parameter of x in the
simple regression. Its standard error is identical to the one of the other standardized
parameter. It is highly significantly different from zero. The gain in coefficient of
2 2
determination R1.23 − R1.2 ≈ 0.33 is large when adding the variable x22 in the model,
whereas it does not matter in the cases 1 and 3. In case 2, the model corresponds to
a nearly exact regression det (R3 ) ≈ 0, with negligible residuals,negligible root mean
square error and a coefficient of determination equal to unity.
- Comparing model A with model D: the standardized coefficient is close to the
simple regression, with a similar t statistic (t statistics are identical for standardized
and non standardized parameters). The second principal component has a relatively
high standardized parameter and it is significantly different from zero.
- Comparing model E with model A and model D: the standardized coefficient is
identical to the simple regression. However, its estimated standard error is identical
to the one of the PC-OLS and very small. Then its t-statistics is very large. The
standardized orthogonal residual of the square of x2 with respect to x has a negative

21
and relatively small standardized coefficient. It is not significantly different from zero.
A general to specific specification search based on t-test without orthogonalization
starting by regression C would lead to rightly accept both explanatory variables, as
well as a complete PC-OLS regression.
Note that including the second principal component is easier to obtain than in-
cluding the Gram Schmidt residual variable, because the information content between
each variable is less asymmetric.

6.3. Case 3: Linear model with one outlier.


Let us compare the results for individual 2 (to be computed):
Table 6: Anscombe ”true outlier model”:
³ ´ regressions (A, B, C, D):
βbS
σ βbS
b V IF RMSE t Power R2
A: x 0.81629 0.18267 1 0.57765 4.47 0.986 0.6663
B: x2 0.82742 0.17759 1 0.56159 4.66 0.991 0.6846
x+x 2 1
C: √2 √1+r23 0.82429 0.17904 1 0.56617 4.60 0.990 0.6795
x−x 2 1
D: √2 √1−r23 -0.07237 0.31540 1 0.99738 -0.23 0.055 0.0052
E: x -0.05722 1.22078 42.53846 0.59190 -0.05 0.050 0.6847
2
E: x 0.88396 1.22078 42.53846 - 0.72 0.109 -
x+x 2 1
F: 2 1+r23
√ √ 0.82429 0.18717 1 0.59190 4.40 0.988 0.6847
x−x 2 1
F: √2 √1−r23 -0.07237 0.18717 1 - -0.39 0.066 -
G: x 0.81629 0.18717 1 0.59190 4.36 - 0.6847
2
G: (x − r23 x) /σ 0.13553 0.18717 1 - 0.72 0.109 -
2
H: x 0.82742 0.18717 1 0.59190 4.42 - 0.6847
H: (x − r23 x2 ) /σ 0 -0.00877 0.18717 1 - -0.05 0.050 -
I: x 0.81629 0.00045251 1 0.00143 1803.92 - 1
I: εb(x,dummy obs10) 0.57765 0.00045251 1 - 1276.55 >.999 -
2
In this case, a non linear model with x is favored, whereas the true model is
an outlier observation model. This is often the case: a non linear model fits with a
general narrative. In the quadratic model, ”too much of something kills something”.
In the interaction term model, a conditional effect (aid works only with good policy,
or aid works outside the tropics) looks as a general ”conditional” result. A general
(conditional) statement is more appealing for publication a particular statement such
that ”Jordan and Egypt” are special cases. Hence, a publication bias is more likely
to happen with non-linear polynomial or interaction models possibly favored by near-
collinearity than for the investigation of the robustness of the models to outliers.

22
6.4. Case 4: Quadratic model with near zero bivariate effects and large and
precise trivariate effects
The dependent variable is now the standardized residual x1 − r12 x2 from case 2. This
residual has a zero sample correlation with the variable x2 .
Table 7: Quadratic model with near zero bivariate effects and large and precise
trivariate effects ³ ´
βbS σb βbS V IF RMSE t Power R2
A: x 0 0.31623 1 1.00000 0.00 0
2
B: x -0.15332 0.31249 1 0.98818 -0.49 0.0235
x+x 2 1
C: √2 √1+r23 -0.07689 0.31529 1 0.99704 -0.24 0.0059
x−x 2 1
D: √2 √1−r23 0.99704 0.02432 1 0.07690 41.00 0.9941
2
E: (x − r23 x) /σ -1 0.00040303 1 0.00127 -2481.2 1
F: (x − r23 x2 ) /σ 0 0.98818 0.04849 1 0.15333 20.38 0.9765
G1: x 6.44503 0.00277 42.53846 0.00134 2326.03 1
2
G2: x -6.52215 0.00277 42.53846 - -2353.9 -
x+x 2 1
H1: 2 1+r23
√ √ -0.07689 0.00042483 1 0.00134 -180.99 1
x−x 2 1
H2: √2 √1−r23 0.99704 0.00042483 1 - 2346.89 -
I1: x 0 0.00042483 1 0.00134 0.00 - 1
2
I2: (x − r23 x) /σ -1 0.00042483 1 - -2353.9 -
2
J1: x -0.15332 0.00042483 1 0.00134 -360.90 - 1
2 0
J2: (x − r23 x ) /σ 0.98818 0.00042483 1 - 2326.03 -
In this case, simple regression effects are zero or small. However, because r12 −r13 is
relatively large so that det (R3 ) = 0, the trivariate model leads to a perfect correlation.
It nonetheless casts doubts on a few non linear models (square or interaction term)
where one of the variable has a near zero correlation with the dependent variable
(r12 ≈ 0), such as in the growth and aid/gdp literature (Doucouliagos and Paldam
[2008]). With data mining trying a dozen of interacting variable with x2 , it is possible
to find an interaction term x4 variable such that x3 = x4 x2 such that: r12 − r13 > 0.05
which may be sufficiently large to turn a near zero effect in bivariate regression as a
large trivariate effect in a precise trivariate regression with near-multicollinearity.
- The variable x3 −r23 x2 is exactly negatively correlated with the dependent variable
x1 − r12 x2 (the partial correlation of x3 and x1 is equal to unity in Anscombe’s case
2).
- The PC-OLS leads to focus on the second principal component, which represents
the smallest part of the variance (0.6%) of the couple of explanatory variables (x2 , x3 ),
but explains 99.4% of the variance of the dependent variable x1 .
- The hierarchical orthogonalization second term has the explains most part of the
variance. As r23 is close to unity, x2 − r23 x is close to the difference x2 − x.
- As a consequence, orthogonalization reveals an odd property (the second compo-
nents turns to be the most significant effect in the regression).

23
- The powers of the t-test is very large for both variables in the trivariate regression,
whereas it is close to zero in the bivariate regressions.
However, as seen in the next section, the ”ceteris paribus” interpretation in case
of multicollinearity is not valid.
In this case, the results to be chosen depend on substantive meaning. Has the
difference between the variable x−x2 a substantive meaning, despite that it represents
only 0.6% of the variance of the group of explanatory variables? Are the variations of
the variable x − x2 due to outliers?

6.5. The “Ceteris Paribus” interpretation of parameters is not correct for


precise regressions with near-multicollinearity
Verbeek [2000] states: “xxx”.
In case 2 and case 4, the standardized parameters are over unity and very large.
When one omits the variable x22 , there is a very large omitted variable bias on the
estimated parameter β12 :

β12 − r12 = 4.5411 − 0.81625 = 3.7249


r12 − r13 r23 r13 − r12 r23
2
− r12 = − 2
r23 = −β13 r32
1 − r23 1 − r23
= − · (−3.7694) · 0.98818
The ”ceteris paribus” (”all other things being equal”, that is: ”the other regressor
being fixed”) sensitivity of the explained variable with respect to a SINGLE regressor
is not valid in this multiple regression with near-collinearity. When the first regressors
moves from one standard error from the mean x2 + σx2 , the second regressor deviates
from its mean with nearly one standard error x3 + 0.98818σx3 . Then, the explained
variable deviation due to a shock of x2 + σx2 boils down to the simple regression effect
with a more reliable predicted value of xb1 :

xb1 = x1 + (4.5411 − 3.7694 · 0.98818) σx1 = x1 + 0.81625σx1


Strictly speaking, the ”ceteris paribus” interpretation is only exactly valid when
the regressors are orthogonals.
Indeed, here, the second regressor is explicitly the square of the first regressor, so
that nobody falls into the error of the ”ceteris paribus” interpretation. However, when
the regressors are not explicitly written as a function of each other while facing near-
collinearity, many applied researcher still interpret their parameter with the ”ceteris
paribus” assumption (all other factors being unchanged). In particular, standardized
parameters over 2 should NEVER be interpreted as a ”ceteris paribus” effect. If x1
and x2 are normally distributed:
S S
]−∞, x2 − σ2 [ ]x2 + σ2 , +∞[→]−∞, x1 − 2σ1 [ ]x1 + 2σ1 , +∞[
33% observations of x2 Predictions of x1 in 5% extreme tails

24
One third of the observations of x2 implies extreme predictions of xb1 in the 5%
tails of the distribution of x1 . These are unreliable ”catastrophic” forecasts.
However, the fact that, because near-collinearity, the Ceteris Paribus interpretation
is not valid does not mean that the regression is not relevant. Both parameters are
precisely estimated because the residuals, their sum of squares and the root mean
square error are close to zero: it more than offsets the rise of variance due to near-
collinearity via the VIF when computing the estimated standard error of the estimated
parameters.
With this respect, Buse [1984] omitted variable bias interpretation is correct. How-
ever, the ceteris paribus interpretation is not valid, whereas in the Gram Schmidt
orthogonalization, the ceteris paribus interpretation on the coefficient of x is correct.
Back to the original parameters is always possible, even with incomplete PC-OLS.

6.6. “Near Spurious” regressions in practice


When revisiting the omitted variables argument, Spanos [2006] considered several
cases with ”spurious” omitted variable bias. In the trivariate case, when the null
hypothesis of a zero correlation coefficient between the dependent variable x1 and the
regressor x2 (H0 : r12 = 0) is accepted, one can infer (with the addition of a severe
testing procedure, see Spanos [2006]) that the relationship between x1 and x2 in a
simple regression model is spurious. When the textbook test of the null hypothesis
of the omitted variable bias β12 − r12 = −r 23 r13
1−r232 = 0 is rejected, it can be misleading,
because “the difference β12 − r12 should not be interpreted as indicating the presence of
confounding, but the result of a spurious relationship between the dependent variable
x1 and the regressor x2 ” (Spanos [2006]).
Conversely, when the null hypothesis of a zero correlation coefficient between the
dependent variable x1 and the other regressor x3 (H0 : r13 = 0) is accepted, one can
infer (with the addition of a severe testing procedure) that the relationship between x1
and x3 in a simple regression model is spurious. When the null hypothesis r12 = 0 and
r23 = 0 are rejected, the textbook test³ of the ´ null hypothesis of the omitted variable
2
r23
r12
bias β12 − r12 = 1−r2 − r12 = r12 · 1−r2 = 0 is rejected, it can be misleading,
23 23
because “the difference β12 − r12 should not be interpreted as indicating the presence of
confounding. In reality, x3 has no role to play in establishing the ”true” relationship
between the dependent variable x1 and the regressor x2 , since there is no relationship
between the dependent variable x1 and the other regressor x3 ” (Spanos [2006]).
In practice, in order to avoid ”near spurious” regressions, an applied researcher
could remove all regressors such that the null composite hypothesis H0 : r1j < 0.1
(for positive sample correlation coefficients) or H0 : r1j > −0.1 (for sample negative
coefficients) at the beginning of his multiple regression specification search, in partic-
ular, when it is a general to specific selection of regressors. The thresholds |0.1| imply
that the true correlation coefficient should explain at least 1% of the variance of the
dependent variable in a simple correlation model (the coefficient of determination is

25
2
such that: R1.j > 1%). It refers to Cohen [1988] classification of effects in his overall
2
evaluation of the power of tests for cross sections: r1j = 0.1 or r1j = 1%: small effect,
2 2
r1j = 0.3 or r1j = 9% medium size effect, r1j = 0.5 or r1j = 25% large effect. This
test is based on Fisher’s Z transform available in many statistical software computing
correlation matrix. For example, using SAS software, the instruction is: “proc corr
data=database fischer (rho0=0.1 lower);”. One may also apply Mayo and Spanos
[2006] severe testing procedure, which is more restrictive.
In a second step, he may look for stratification (Hendry and Nielsen [2008]), that
is finding subsets of observations (defined by a dummy variable 1subset ) such that
|r (x1 , x2 · 1subset )| ≥ 0.1 for the excluded variables such that |r (x1 , x2 )| ≤ 0.1. This
dummy variable for the selection of observations may be itself endogenous and may
depend on variables zj , or on the variable x2 itself (quadratic model) so that interaction
terms may matter |r (x1 , x2 · zj )| ≥ 0.1 as well. The key point is to exclude x2 in
the general model before the general to specific selection of regressors, so that the
interaction terms x2 · zj does not turn to be significantly different from zero due to
the near-collinearity regression.
An applied researcher may evaluate with care whether the selected observations
driving the correlation r (x1 , x2 · zj ) are only a few observations (outliers): in this
case the initial robust selection of observations is likely to perform much better:
r (x1 , x2 · 1subset ) > r (x1 , x2 · zj ). Because sometimes, it may be more interesting to
find general results than stating particular cases, one may be tempted to favor regres-
sions with x2 · zj becausethey are ”general conditional effect”, despite the fact that
they are not robust to the removal of a few outliers. As shown in Roodman [2008]
example: an editor would favour a general conditional argument ”aid is correlated
with growth for non tropical countries” than ”large aid/gdp with high growth rate is
found in Jordan, Egypt and Botswana”.

7. Conclusion
This note presented orthogonalized regressors and constrained estimations. Present-
ing the correlation matrix including the explained variable, the VIF and the confi-
dence interval of each estimated parameters, discussing the plausibility of the size
of the (standardized) parameters and explaining the data generating process of the
collinearity between regressors is useful.

References
[1] Anscombe A. [1973]. Graphs in Statistical Analysis. American Statistician, 27,
pp. 17-21.

[2] Buse A. [1994]. Brickmaking and the collinear arts: a cautionary tale. Canadian
Journal of Economics. Revue Canadienne d’Economie. 27(2), pp.408-414.

26
[3] Castle J. [2008]. Econometric Model Selection: Nonlinear Techniques and Fore-
casting, Saarbruumlcken: VDM Verlag. ISBN: 978-3-639-00458-8

[4] Castle J. and Hendry D. [2008]. A Low-Dimension Portmanteau Test for Non-
linearity (with David F. Hendry), Working paper No. 326, Economics Depart-
ment, University of Oxford.

[5] Castle J. and Hendry D. [2008]. Extending the Boundaries of Automatic Model
Selection: Non-linear Models’, Working paper, Economics Department, Univer-
sity of Oxford.

[6] Cohen [1988]. Statistical Power Analysis for the Behavioral Sciences.

[7] Doucouliagos H. and Paldam M. [2009]. The aid effectiveness literature: The
sad results of 40 years of research. Journal of Economic Surveys, forthcoming.
Working paper, M. Paldam’s webpage.

[8] Greene W.H. [2000]. Econometric analysis. Fourth edition.

[9] Hendry D.F. and Nielsen B. [2007]. Econometric Modelling. A Likelihood Ap-
proach. Princeton University Press.

[10] Massy W.F. [1965]. Principal Components Regression in Exploratory Statistical


Research. Journal of the American Statistical Association. 20(309), pp.234-256.

[11] Puntanen S. and Styan G.P.H. [2005]. Schur complements in statistics and prob-
ability. In The Schur complement and its applications, F. Zhang editor, Springer
Verlag, pp. 163-226.

[12] Roodman D. [2008]. Through the Looking Glass, and What OLS Found There:
On Growth, Foreign Aid, and Reverse Causality. Center for Global Development
Working Paper Number 137.

[13] Spanos A. and McGuirk A. [2002]. The Problem of near-multicollinearity revis-


ited: erratic vs systematic volatility. Journal of Econometrics, 108, 365-393.

[14] Spanos [2006]. Revisiting the omitted variables argument: Substantive vs. statis-
tical adequacy. Journal of Economic Methodology, Volume 13, Issue 2, pages 179
- 218

[15] Stanley, T.D., [2005]. Beyond publication bias. Journal of Economic Surveys 19,
309-45.

[16] Paper on selecting variables versus t statistics. SAS

27
7.1. Constrained Regressions and Near-Collinearity (Burnside and Dollar)
Not for publication. An answer to near-collinearity is to constrain parameters.
0
βb13 = 0, βb13 = 0, βb13

=0 (7.1)
(1) Principal component regression: one decides to omit the principal component
axis with the lowest eigenvalue (the lowest share of variance) or with the lowest non
significant t-statistic. This is known as the ”Incomplete Principal Component Regres-
sion”.
(2) With the hierarchic orthogonal regression, one decides to omit one of the vari-
ables, i.e.: the residual of a regression between two near-collinear variables.
(3) One may omit directly one of the two variables.
Let us mention a fourth constraint on parameters leading to unbiased parameter
estimations with a constrained parameter. In a first step, one estimated the regression
with near-collinear variables:

x1 = βb1.2 x2 + βb1.3 x3 + ε1.23 (7.2)


à !
1.3 βb
x1 = βb1.2 x2 + b x3 + ε1.23
β1.2
b
β1.3 r13 − r12 r23
z = x2 + b x3 = x2 + x3 (7.3)
β1.2 r12 − r13 r23
b
In a second step, one defines the variable z using the weights 1 and βb1.3 . Let us
¯ ¯ β1.2
¯b ¯
remark that with near collinearity ¯¯ βb1.3 ¯¯ ≈ 1. As the consequence, z is close to the sum
β1.2
of the two near-collinear variable. It is close to be proportional to the first principal
component axis.

28

You might also like