Professional Documents
Culture Documents
1. Introduction
In todays world, a country with unemployment that is resulted by the effects of
economical and social effects comes across multidimensional problems. The condition
and qualification of labor force in a country show economical developments. In the light
of these facts, a developing country should overcome the problem of unemployment.
According to Turkish Statistical Institute, active people in ages of between 15 and 60 that
are labor force consist of non-institutionalization population. Unemployment is defined
740 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
as jobless who are looking for a job that offers the current fee level (Goktas and Isci,
2010).
In the literature, there are various factors that have effect on unemployment.
Cascio (2001) investigates monetary policy and unemployment relationship for 11
OECD countries over 1979:Q1-1998:Q4 by using Vector Autoregressive (VAR) model.
According to Cascio (2001), monetary shocks influence unemployment but they differ
from country to country. Namely, local factor(s) is/are important how a labor market is
influenced. Djivre and Ribon (2003) study monetary policy influence on unemployment,
inflation and exchange over 1990-1999 for Israel and they found that tight monetary
policy shocks increase unemployment. Ravn and Simonelli (2007) find that monetary
policy shocks extracted influence on unemployment for the United States over 1953:Q32003:Q1. Karanasou and Sala (2010) investigate driving forces behind unemployment
for Australia over time and find that reasons behind unemployment differ according to
period investigated. For example, 1970s driving force behind unemployment is oil shock
while in 1990s and 2000s; interest rate is important driving force. Yet, currently, the most
influential factor is the tight foreign demand due to global crisis. Further, another study
on unemployment by Valletta and Kuang (2010) shows that the recent increase in
unemployment is conjectural rather than structural for the United States. In general,
conjectural fluctuations, like fluctuation in exchange rate, international interest rate, and
decline in foreign demand are the shocks that extract influence on unemployment (Dogan,
2012).
There are no many macro empirical studies on unemployment in Turkey. Further,
in studies regarding Turkeys unemployment, structural breaks have not been considered
so as they have not been introduced in VAR specification. For Turkey, Berument et al.
(2006, 2009) and Berument (2008) investigates macroeconomic policy shocks on
unemployment by using VAR models. The general conclusion derived from those studies
is that positive income shocks reduce unemployment. Aktar and Ozturk (2009) study
interaction among macroeconomic variables for Turkey and find that positive income
shocks create statistically significant negative effect on unemployment. They, also, find
that export is not statistically significant influence on unemployment. Dogrul and Soytas
(2010) investigate relationships among unemployment, oil price and interest rate and find
that interest rate shocks left long-term impact on unemployment even though initial
impact on unemployment is negative and insignificant. The aim of this study is to
examine the factors affecting the unemployment with Partial Least Squares (PLS)
analysis in Turkey after the 2008 Global Financial Crisis. In their study macroeconomic
741
variables; industrial productivity index, real wage index, growth rate, consumer price
index, the ratio of import to Gross National Product and the ratio of export to the Gross
National Product are used in modelling the response variable; the rate of unemployment.
Explanatory variables are taken for t time and eight term lags for 2005:Q1-2010:Q3. The
results of the analysis show that same results are obtained for industrial productivity
index and real wage index. The ratio of import to Gross National Product variable has a
great contribution in modelling unemployment rate for all terms except first and second
lags. Analysis results show that macroeconomic indicator: consumer price index has a
significant contribution in all terms (Dogan, 2012). Goktas and Isci (2010) aimed to
remove the collinearity on factors that affect the rate of unemployment and obtained the
new variables from the factors via using the principal components. The new variables
that are regressors are used in constructing of unemployment regression model. After
they have checked the assumptions of statistical inference, they forecasted the
unemployment in Turkey. Dogan (2012) investigates the response of unemployment to
selective macroeconomics shocks for the period of 2000:Q1-2010:Q1. It finds that
positive shocks to growth, growth in export and inflation reduce unemployment. On the
other hand, shocks to exchange rate, interbank interest rate and money supply increase
unemployment. The results are consistent with Phillips curve and Okuns Law suggestion.
Namely, negative relationship between output and unemployment and positive
relationship between unemployment and inflation are found. Umit and Bulut (2013)
examine the factors affecting the unemployment with Partial Least Squares (PLS)
analysis in Turkey after the 2008 Global Financial Crisis. In their study macroeconomic
variables; industrial productivity index, real wage index, growth rate, consumer price
index, the ratio of import to Gross National Product and the ratio of export to the Gross
National Product are used in modelling the response variable; the rate of unemployment.
Explanatory variables are taken for t time and eight term lags for 2005:Q1-2010:Q3. The
results of the analysis show that same results are obtained for industrial productivity
index and real wage index. The ratio of import to Gross National Product variable has a
great contribution in modelling unemployment rate for all terms except first and second
lags. Analysis results show that macroeconomic indicator: consumer price index has a
significant contribution in all terms.
In several linear regression and prediction problems, the independent variables
may be many and highly collinear. This phenomenon is called multicollinearity and it is
known that in the case of multicollinearity the Ordinary Least Squares (OLS) estimator
for the regression coefficients or predictor based on these estimates may give very poor
results. Therefore, several biased estimation methods have been developed to overcome
multicollinearity problem such as Ridge Regression (RR), Principal Component
742 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
Regression (PCR) and Partial Least Squares Regression (PLSR). The main goal of biased
methods is to decrease the mean squared error of prediction by introducing a reasonable
amount of bias into the model. In most real systems, exact collinearity of variables in X
is rather unusual, because of the presence of random experimental noise. Nevertheless,
in systems producing nearly collinear data, the solution for regression coefficient is
highly unstable, such that very small interferences in the original data (for example,
because of noise or experimental error) cause the method to produce madly different
results. In addition, the use of highly collinear variables in Multiple Linear Regression
(MLR) also increases the possibility of overfitting the model. Overfitting means that the
model may fit the data very well, but fails when used to predict the properties of new
samples (Martens and Naes, 1989; Naes et al., 2002, Polat and Gunay, 2015). In addition
to multicollinearity, outliers are also the other problem encountered in regression models.
If an observation does not follow the model that fits the majority of the data points, it is
an outlier. Its occurrence can, e.g. be due to a measurement error, to a change in the
experimental conditions or to the fact that the sample belongs to a population other than
the one under study etc. It is also well known that the MLR method is highly sensitive to
outliers. Both outliers in the space of the response variables and those in the space of the
explanatory variables can unduly influence the parameter estimates (Hubert and
Verboven, 2003). To solve this problem, the robust versions of regression models were
proposed. In the case of the presence of both multicollinearity and outliers, the robust
versions of the biased RR, PCR and PLSR methods called as Robust Ridge Regression
(RRR), Robust Principal Component Regression (RPCR) and RSIMPLS are used.
In this study, for determining unemployment rate in Turkey, as the presence of
simultaneously multicollinearity and outliers in the data set, performance of classical
biased RR, PCR and PLSR methods and their robust versions RRR, RPCR and RSIMPLS
are compared. Consequently, the method giving the best model is selected from among
these six methods. The rest of the paper is organized as follows. The classical biased RR,
PCR and PLSR methods are reviewed in Section 2. In Section 3, the robust versions of
these three methods RRR, RPCR and RSIMPLS are described. In Section 4, a real
unemployment data set of Turkey for the period of 1985-2012, in which both
multicollinearity and outliers exist, analyzed by using these six methods and the
performance of these methods are compared to each other by using trimmed Root Mean
Squared Error (TRMSE) statistic.
2. Classical biased estimation methods
743
There are several classical biased estimation methods in the literature. In this
study, the most commonly used three of them are used for analysis. These are RR, PCR
and PLSR. In this section, first of all, MLR is briefly outlined in order to clarify the cause
of using biased estimation methods in case of multicollinearity. Then, the most popular
biased estimation methods RR, PCR and PLSR are presented. The emphasis here is on
the algebraic derivation of the vector (or matrix) of coefficients in the linear regression
models for these four methods. Throughout this paper, matrices are denoted by bold
capital letters and vectors are denoted by bold lowercase letters.
2.1. Multiple Linear Regression
Firstly, the regression model used for this method is defined by Equation (1).
Traditionally, the most frequently used method for finding is the OLS. If X has a full
rank of p (number of independent variables) then the OLS estimate of given by
Equation (2).
gives unbiased estimates for the elements of . The corresponding
OLS
vector of fitted values obtained as in Equation (3) (Phatak and De Jong, 1997).
y X
(1)
1
OLS XX Xy
(2)
1
y OLS X OLS XXX Xy x y
(3)
In order to estimate , by using the OLS method, requires that the X variables
must be linearly independent and the number of independent variables, p, must be equal
or smaller than the number of observations, n p n (Trygg, 2002).
If there is a multicollinearity problem, the variance of the least squares (LS)
estimator may be very large and subsequent predictions rather inaccurate. However, if
insisting on unbiased estimators given up, biased methods can be used to overcome the
problem of inaccurate predictions. For this reason, biased methods such as RR, PCR and
PLSR are used with the consequent trade-off between increased bias and decreased
variance. The idea behind PCR and PLSR methods is to discard the irrelevant and
unstable information and to use only the most relevant part of the x-variation for
regression. Hence, the collinearity problem could be solved that more stable regression
equations and predictions obtained (Naes et al., 2002; Polat and Gunay, 2015).
744 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
RR (XX k I)1 Xy
(4)
k ps 2 / OLS0 OLS0
(5)
745
excluded in the calculation. The simplest way to determine the optimum value of k is to
plot the values of each RR versus k (in the range 0-1), which is called as ridge trace
(Myers, 1990). In ridge trace graph, a trace or a curve is formed for each of the
coefficients. From the ridge trace the minimum k value that makes the RR stable is could
be chosen and in this chosen k value the residual sum of squares could converge its
minimum value (Hoerl and Kennard, 1970). However, Van Nostrand (1980) stated that
since the determination of what is the stability in ridge trace is subjective and hence the
selection of k is arbitrary, there is a tendency to choose a very big value of k while
choosing k based on the ridge trace. Hence, the selection of k based on Equation (5) could
be better (Rawlings, 1988).
2.3. Principle Component Regression
PCR and PLSR methods assume that the p-dimensional independent x-variables
and a set of q-dimensional dependent y-variables are related through a bilinear model. n
is the number of observations and for i=1,,n this bilinear model is shown as in Equation
(6) and Equation (7). Here x is the mean of x variables, y the mean of the y variables,
~
ti are k dimensional scores with k<<p, Pp,k the matrix of x-loadings and A k , q represents
the slope matrix in the regression of yi on ~ti . The error terms are denoted by fi and gi. In
terms of the original independent variables, this bilinear model can be written as in
Equation (8). The main difference between PCR and PLSR lies in the construction of the
scores ~ti . In PCR the scores are obtained by extracting the most relevant information
present in the x-variables by performing a PCA on the independent variables, thus, using
a variance criterion. In contrast, the PLSR scores are calculated by maximizing a
covariance criterion between the x- and y-variables (Hubert and Verboven, 2003; Hubert
and Vanden Branden, 2003).
~
x i x Pp,k ti g i
(6)
~
yi y Aq ,k ti f i
(7)
yi 0 Bq,p x i ei
(8)
746 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
PCR starts by centering the data through the mean x of the x-variables and the
mean y of the y-variables. Let the centered observations be denoted by (Hubert and
Verboren, 2003)
~
xi x i x
(9)
~
yi yi y
(10)
~
Next the k-dimesional scores of each data point, ti , are computed as the coordinates of the projections of ~x i onto this subspace, or equivalently as in Equation (11).
~ ~~
ti Pxi
(11)
~
In the final step, ~y i is regressed onto ti via multiple linear regression. Fitted
linear model can be obtained as in below.
~
~
yi A ti ~
i
(12)
The parameter estimates and fitted values can be expressed as in Equation (13)
and Equation (14), respectively.
(TT)-1T~
A
y
k,q
(13)
~t y
y i A
i
(14)
747
~
PCR PA
(15)
~
x i x P ti g i
~
yi y A ti fi
(16)
(17)
~
where ti are called scores, P is matrix of x loadings, A is the slope matrix in the
~
regression of y i on ti , g i and f i are residuals (Hubert and Vanden Branden, 2003).
3. Robust versions of bias estimation methods
Besides multicollinearity, OLS are considerably affected by only or few
observations. Namely, not all data points in a data set have the same significance in
determining estimates, test and other statistics. It is important that the data analyst should
be aware of such kind of points known as outliers. To overcome the effects of outliers,
robust regression models are proposed. In the presence of both multicollineriaty and
outliers, robust versions of bias estimation methods are proposed. The robust versions of
RR, PCR and PLSR are given follow.
3.1. Robust Ridge Regression
RR is based on least squares and it is sensitive to atypical observations. Hence,
the approach of MM estimation, which is repeated M estimation, is proposed by Maronna
(2011) to ensure both robustness and efficiency under the normal model. The vector of
residuals in RR is given by Equation (18). A scale M estimator of the data vector
748 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
e RR e1RR ,, e RR
y X RR
n
(18)
1 n eiRR
n i 1
(19)
1 n RR 2
(ei ) is obtained. Maronna (2011) employed
n i 1
for the bisquare -function given in Equation (20) (Maronna, 2011).
the classical mean square error: 2
2 3
bis t min
1,1 1 t
eRR
(20)
ini
ini
1 n eiRR
0
n i 1 ini
n
2
L(X, y, ) ini
(
i 1
(21)
eiRR ()
2
)k
ini
(22)
749
Therefore, Equation (23) shall be used where the constants c 0 c are chosen to
control both robustness and efficiency (Maronna, 2011).
t
0 (t) bis ,
c0
t
(t) bis
c
(23)
(t) (t) ,
W(t)
(t)
t
(24)
Let
ti
e iRR
W(t i )
, wi
, w ( w1,...., w n ) , W diag( w)
2
ini
(25)
Setting the derivatives of Equation (22) with respect to to zero yields for RRR
Equation (26) and Equation (27). Since for the chosen , W t is a decreasing function
of t observations with larger residuals will receive lower weights w i (Maronna, 2011).
W(y - X RR ) 0
(26)
(27)
(XT WX kI) RR XT Wy
As is usual in robust statistics, these weighted normal equations suggest an
iterative produce. Starting with an initial RR : Compute the residual vector e RR and the
weights w. In order to choose k, Cross-Validation (K), K-fold cross validation process,
which requires recomputing the estimate K times. y i which is the fit of y i computed
without using the i-th observation is expressed as in Equation (28). It is the RRR estimate
computed without observation i. Then a first-order Taylor approximation of the estimator
yields the approximate prediction errors as shown in Equation (29), where
750 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
y i xiT -iRRR
eRRR
yi y i eiRRR (1
i
(28)
W( t i ) h i
)
1 hi(t i )
(29)
y i y i y , then, performing
x i x i x and ~
PCR starts by centring the data as ~
a PCA on the x-variables in order to cope with multicollinearity. The PCA loading matrix
~
T
Pk , p p1,, pk then contains the first k dominant eigenvectors of the empirical
1 ~T ~
~ ~
X p, n Xn , p and the scores satisfy ti Pk,Tp ~
xi . In the second
n 1
~
~
y i are regressed onto ti as ~
step of PCR, the dependent variables ~
yi AT ti ~
i using
MLR. Then, the parameter estimates and fitted values are obtained as
~
TT1 TT Y
T ~
A
k,q
k , k k,n n , q and y i Aq, k ti y , respectively. The unknown regression
~
parameters in Equation (1) are then estimated as PCR PA
(Hubert and Verboven,
2003).
covariance matrix, S x
751
estimator then seeks that h subset whose classical covariance matrix has minimal
determinant. The number h determines the robustness of the estimator and should be at
least (n+p+1)/2. The MCD location estimate is given by the mean x h and the MCD
, multiplied by a consistency factor. A robust
scatter estimator by its covarince matrix
h
PCA method yields a tolerance ellipse which captures the covariance structure of the
majority of the data points. The robust tolerance ellipse is obtained by applying the highly
robust MCD estimator of location and scatter to the data, yielding MCD and
MCD and
by
plotting
the
points
x
whose
robust
distance
T 1 x is equal to 2 . Based on
Dx D x,
,
x
MCD
MCD
MCD
MCD
MCD
2, 0.975
the raw MCD estimate, a reweighting step can be added which increases the finite sample
efficiency considerably. In that case, each data point receives a weight one if it belongs
to the robust tolerance ellipse and a weight zero otherwise. The reweighted MCD
estimator then equals the classical mean and covariance matrix of the data points with
weight one. Furthermore, there is no distinction any more between the raw and the
reweighted MCD estimator, as it is assumed that always the reweighted one is used. The
first k eigenvectors of the MCD estimator, sorted in descending order of the eigenvalues,
then yield robust loadings (Engelen et al., 2004; Hubert and Verboven, 2003).
ROBPCA is a robust PCA method which combines projection pursuit ideas with
MCD covariance estimation in lower dimensions. Firstly, x data are preprocessed by
reducing their data space to the affine subspace spanned by the n observations. This can
be easily performed using a singular value decomposition of Xn,p. In the second step of
the ROBPCA algorithm, a measure of outlyingness is computed for each data point. This
is obtained by projecting the high-dimensional data points on many univariate directions.
On every direction a robust centre and scale of the projected data points is computed, and
for every data point its standardized distance to that centre is measured. Finally, for each
data point its largest distance over all the directions is considered. The h data points with
smallest outlyingness are then retained and from the covariance matrix of this the final h
subset, the number of principal components (PCs) to retain, k, is selected. ROBPCA
yields more accurate estimates at uncontaminated data sets and more robust estimates at
contaminated data sets. Briefly, ROBPCA method applied to X n ,p yields robust scores
can be derived as t i Pk,Tp xi x . Here, Pp,k is the loading matrix with orthogonal
columns and x is a robust centre (Engelen et al., 2004; Hubert and Verboven, 2003).
752 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
MCD scatter matrix defines a metric in the PCA subspace. Let L denote the
diagonal matrix which contains the eigenvalues l j of the MCD scatter matrix, sorted
from largest to smallest. Then, the score distance of a p-dimensional point x with respect
to x , P and L is defined as in Equation (30) (Hubert and Verboven, 2003).
SDx SDx, , P, L D t P T x x , 0, L
(30)
t 2j
t L t
j 1 l j
T
In the second stage of RPCR method y i is regressed on t i that if there is only one
y-variable the reweighted LTS regression is preferred, else the MCD regression is
performed. Here, the regression model with with intercept written as in Equation (31)
with Cov . In case of one dependent variable (q=1), this model simplifies as in
Equation (32) with scale of the errors. The parameters in (6) could be estimated by
using the LTS estimator. The raw LTS estimator minimizes the sum of the h smallest
squared residuals as shown in Equation (33). Here, r12:n r22:n rn2:n denote the
ordered squared residuals. An initial estimate of the error dispersion is given by Equation
(34). Here c h is a consistency factor for normally distributed errors. The reweighted LTS
estimator then corresponds to the LS estimator applied to the observations whose
absolute standardized residual is not too large. That means, if ri , 0 LTS / 0 2.5 it
y i 0 ATt i i
(31)
yi 0 T t i
(32)
, 0
i 1
i:n
(33)
753
1 h 2
r , 0 LTS
h i 1
0 c h
(34)
i:n
In case of q>1, the MCD regression estimator is used. It starts by computing the
reweighted MCD estimator on the t i , y i jointly, leading to a (k+q)-dimensional
k q ,k q
shown in Equation (32). Similarly with the MLR estimates which are based on the
emprical covariance matrix of the joint t i , y i variables, robust parameter estimates are
then estimated as shown in Equation (33). This robust regression estimators efficiency
can also be increased by performing a reweighting step. To apply this reweighting
scheme, each data point receives a zero weight if it is initial residual distance is unsually
large as shown in Equation (34), with Equation (35) and Equation (36). All other
observations have a weight w i 1 (Hubert and Verboven, 2003). The reweighted MCD
regression parameters then correspond to the MLR estimates based on those observations
with weight one. The final residual distances are obtained by filling in the reweighted
estimates for A and 0 in Equation (33), Equation (35) and Equation (36). A different
notation for the final estimates and residual distances is not introduced, but it is assumed
that the reweighting step is indeed applied. The fitted values are obtained as in Equation
is set
(37) and regression parameters derived as in Equation (38). Finally,
MCD
yt
ty
(35)
T
0 y A
A
k ,q
t
ty
A
A
y
t
(36)
w i 0 if RD i q2,0.975
(37)
Tt
ri y i A
i
0
(38)
(39)
rT
1r
RDi D ri ,0,
i
i
754 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
T t
y i A
q, k i
0
(40)
T PT x
A
q, k k,p
i
x
0
P A
B
p ,q
p,k
k ,q
0 B
0
p ,q x
(41)
As for the MCD estimator, the robustness of the RPCR algorithm depends on the
value of h, which is chosen in the ROBPCA algorithm, LTS regression and MCD
regression. In MATLAB implementation the user can either choose h from the start or
choose a value of with 0.5 1 , such that 1 corresponds with the percentage of
outliers that the algorithm should be able to resist. Then, h is set as the maximum of
n k q 1
h 1 n and h 2
where h 2 is the required minimal value for the MCD
2
taking r1 and q1 are obtained as the first left and right singular eigenvectors of
755
~ ~
STyx Sxy Xp,T n Yn ,q /n 1 , the cross-covariance matrix of the x- and y-variables. For
~
~
each observation the first coordinate of the score ti is computed as ti1 ~
xiTr1 . To obtain
~
the other PLSR weight vectors ra and qa for a 2, , k , the components Xrj are
n
the cross-covariance matrix Sxy provides the solutions for the other PLSR weight vectors.
This deflation is carried out by first calculating the x-loading p j S xrj / rjTS xrj with Sx
general, PLSR weight vectors ra and qa are obtained as the left and right singular vectors
~
of S axy . The elements of the scores ti are then defined as linear combinations of the
~
~
~
mean-centered data, t ~
x Tr or equivalently T X R with R r ,, r .
ia
n ,k
n ,p
p,k
p,k
Finally, where the scores are k-dimensional, MLR is performed of the dependent
~
variables yi on these scores ti . Thus, the formal regression model under consideration in
Equation (42), where E f i 0 and Covf i f . MLR provides estimates as in
~
Equation (43), Equation (44) and Equation (45). By inserting ti R k,T p xi x in
Equation (42), estimates for the parameters in the original model in Equation (46) are
obtained as in Equation (47). Finally, also an estimate of e is provided by rewriting S f
TS B
(Engelen et al., 2004; Hubert and
in terms of the original parameters: S S B
e
~
y i 0 Aq ,k ti f i
(42)
S 1 S R S R 1 R S
A
k ,q
t
ty
k,p x
p,k
k,p xy
(43)
~t
0 y A
q ,k
(44)
S A
Y Y A
T T A
Sf S y A
q,k t
k,q
q,n n ,q
q,k k,n n,k
k,q
(45)
y i 0 Bq,T p xi ei
(46)
756 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
R A
T
B
p ,q
p,k
k ,q and 0 y Bq, p x
(47)
Since SIMPLS is based on the empirical cross-covariance matrix between the yvariables and the x-variables and on linear LS regression, the results are also affected by
outliers in the data set. Hence, Hubert and Vanden Branden (2003) have been suggested
a robust version of SIMPLS method called as RSIMPLS (Hubert and Vanden Branden,
2003). A robust method RSIMPLS starts by applying ROBPCA on the x- and y-variables
~
in order to replace Sxy and Sx, which are used to calculate ti , by robust estimates and
then proceeds analogously to the SIMPLS algorithm. Similar to RPCR instead of MLR
a robust regression method (ROBPCA regression) is performed in the second stage
(Engelen et al., 2004; Hubert and Vanden Branden, 2003). To obtain robust scores, firstly,
ROBPCA is applied on Z n ,m X n ,p , Yn ,q . ROBPCA is robust covariance estimator for
high-dimensional data sets (m>n). ROBPCA combines the two approaches. Using
projection pursuit ideas, it computes the outlyingness of every data point and then
considers the empirical covariance matrix of the h data points with smallest outlyingness.
The data are then projected onto the subspace K 0 spanned by the k 0 m dominant
eigenvectors of this covariance matrix. Next the MCD method is applied to estimate the
center and scatter of the data in this low dimensional subspace. Finally these estimates
are back transformed to the original space and a robust estimate of the center z of Z n ,m
Pz Lz Pz
in decreasing
that the diagonal matrix Lz contains the k 0 largest eigenvalues of
z
T
x
T T
y
. can be split
and
z
z
weight vectors ra are computed as in the SIMPLS algorithm, but now starting with
xy
r
instead of S xy . The x-loadings are defined as p j rjT
x j
x j
calculated as in Equation (49), where the x i are the robustly centered observations
(Engelen et al., 2004).
757
z
yx
xy
(48)
T
t ia x iT ra x i x ra
(49)
Once the robust scores are derived, a robust linear regression is performed. The
regression model based on robust scores is written as in Equation (50). In order to
estimate parameters in this model a robust regression method called ROBPCA regression
is used. This method uses additional information from the previous ROBPCA step
(Engelen et al., 2004).
yi 0 Aq,T k ti fi
(50)
Firstly, to obtain the robust scores t i ROBPCA has been applied to the (x,y)variables and a k 0 -dimensional subspace K 0 , which represented these (x,y)-variables
well, has been obtained. Because the scores were then constructed to summarize the most
important information given in the x-variables, it is expected that outliers with respect to
this k 0 -dimensional subspace are often also outlying in the t , y -space. Hence, the
center and scatter of the (t,y)-variables is estimated as the weighted mean and
covariance matrix of those t i , y i whose corresponding x i , y i are not outlying to K 0
as shown in Equation (51). If observation i is not identified as an outlier by applying
ROBPCA on x, y then w i 1 and otherwise w i 0 .
t n t n
w i i / w i
y i 1 y i i 1
yt
n ti
ty
w i t iT
i
1
yi
y
n
y Ti / w i 1
i 1
(51)
Two types of outliers: those which are outlying within K 0 and those which are
lying far from K 0 are could be identified by applying ROBPCA The first type of outliers
can be easily identified as those observations whose robust distance
Di k 0
t L
z T
i
z 1 z
i
each data point its orthogonal distance OD i od2 od2 z 0.975 to the subspace K 0 is
considered. The distribution of these orthogonal distances is difficult to determine
exactly, however, motivated by the central limit theorem, it appears that the squared
758 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
orthogonal distances are roughly normally distributed. Hence, their center and variance
2
is estimated with the univariate MCD, yielding od2 and od
. Then if
2
ODi od2 od2 z 0.975 , w i is set to zero. Having identified the observations with
weight one, thus, and are computed from (51). Then these estimates are inserted in
Equations (43)-(45) and a reweighted MLR is performed. It is recommended to reweight
these initial regression estimates in order to improve the finite sample efficiency. Let ri ( k )
be the residual of the ith observation based on the initial estimates that were calculated
is the initial estimate for the covariance matrix of the errors,
with k components. If
f
then the robust distance of the residuals is defined as in Equation (52). The weights c i k
are computed as in Equation (53) with I the indicator function. The final regression
estimates are then calculated as in classical MLR, but only based on those observations
with weight c i k equal to one. This reweighting step has the advantage that it might again
include observations with w i 0 which are not regression outliers. The robust residual
distances RD i k are recomputed as in Equation (52) and also the weights c i k are
adapted. Robust parameters for the original model in Equation (46) are then given by
Equation (54).
1r
RDik riTk
f i k
1/ 2
c i k I RD i2k q2,0.975
0 0 B
x
(52)
R A
B
p ,q
p,k
k ,q
(53)
e
f
(54)
759
Agriculture (CEA), Unit Labor Cost (ULC), Purchasing Power Parities (PPP), Trend (T).
The data set is taken from OECD data base.
Firstly, MLR analysis applied on the data set and it is found that the model
obtained by using OLS method is significant with a probability of 95% (F=4.95; p=0.002).
Even though the MLR model fits the data well, multicollinearity may severely prohibit
quality of the prediction. Table 1 shows that all independent variables are not significant
as an indicator of multicollinearity problem. Since 1 , , p are the eigenvalues of
correlation matrix of X X , the existence of collinearity problem also could be seen by
examining the condition number that calculated as max / min =9.112/0.0003=30373. The
condition number greater than 30 means that there is multicollinearity. The other
multicollinearity measure is Variance Inflation factor (VIF). The larger the VIF value,
the more serious the collinearity problem. In practice, if any of the VIF values is equal
or larger than 10, there is a near-collinearity. In this case, the regression coefficients are
not reliable. The VIF values of GDP, IMP, EXP, P, E, CPI, RCPI, CEA, ULC, PPP and
T are found as 3.32, 9.79, 6.05, 6.79, 39.25, 32.99, 22.25, 67.86, 2.98, 157.42 and 51.57,
respectively. Since the VIF values of E, CPI, RCPI, CEA, PPP and T are found greater
than 10, there is a near-collinearity problem for this data set.
Table 1: The estimated regression coefficients for the MLR model
Model
Coefficients
Constant
Trend
GDP
EXP
IMP
P
E
PPP
CPI
RCPI
CEA
ULC
24.866
-0.3588
-0.11589
-0.01865
0.00542
1.120
-2.001
12.982
0.02521
-0.11798
-0.2399
0.03096
Standart Error
of Coefficients
9.464
0.1838
0.08178
0.01123
0.01735
1.787
1.867
6.365
0.03897
0.06699
0.1841
0.01681
2.63
-1.95
-1.42
-1.66
0.31
0.63
-1.07
2.04
0.65
-1.76
-1.30
1.84
0.018
0.069
0.176
0.116
0.759
0.540
0.300
0.058
0.527
0.097
0.211
0.084
Secondly, whether outliers exist or not is examined using normal Q-Q plot of the
MLR residuals given in Figure 1. As seen from Figure 1, there are two outliers in the
data.
760 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
MLR
2.5
2
1.5
residuals
1
0.5
0
-0.5
-1
-1.5
-2
-3
-2
-1
Norm al quantile s
761
RR
2.5
2
1.5
residuals
1
0.5
0
-0.5
-1
-1.5
-2
-3
-2
-1
Norm al quantiles
RRR
5
4
3
residuals
2
1
0
-1
-2
-3
-2
-1
0
Norm al quantile s
762 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
PCR
2.5
2
1.5
1
residuals
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
Norm al quantiles
RPCR
4
3
2
residuals
1
0
-1
-2
-3
-3
-2
-1
0
Norm al quantiles
763
SIMPLS
2.5
2
1.5
1
residuals
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
Norm al quantiles
RSIMPLS
5
4
3
residuals
2
1
0
-1
-2
-3
-4
-3
-2
-1
0
Norm al quantiles
Figure 2: Residuals Q-Q Plots of the Classical and Robust Biased Regression Methods
The predictive performance of the methods are evaluated by using the Root Mean
n
y
i 1
y i
RR
17.9265
-.0976
-.0819
PCR
25.6056
-.1611
-.0929
PLSR
28.2426
-.2322
-.0895
RRR
23.2473
-.4540
-.0430
RPCR
27.1417
-.3524
-.1003
RSIMPLS
13.4399
-.4302
-.0937
764 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
IMP
EXP
P
E
PPP
CPI
RCPI
CEA
RULC
-.0007
-.0076
.1890
-.2168
3.1610
-.0153
-.0526
-.1119
.0130
Biasing Parameters
k RR 0.5
TRMSE(0.9)
0.6935
.0004
-.0094
-.0903
.6063
.1285
-.0204
-.0644
-.2333
.0142
.0003
-.0136
1.2939
1.0358
1.1919
.0001
-.0913
-.3283
.0212
.0048
-.0073
-1.1617
-1.0185
11.9718
.0189
-.0829
-.1669
.0272
.0162
-.0054
-.2164
2.3267
.5134
-.0167
-.0680
-.2870
.0256
.0246
.0017
-.1439
3.2841
.9214
-.0077
.0482
-.1663
.0090
0.5059
0.5984
k RR 11
0.6640
0.6575
0.3240
Since there is both multicollinearity and outliers in the data set, it is expected that
robust biased regression methods give more accurate results than their classical versions.
As seen from Table 2, the smallest TRMSE values are obtained using robust RRR and
RPCR methods, respectively. According to TRMSE values, robust methods RRR, RPCR
and RSIMPLS are clearly better than their classical counterparts RR, PCR and PLSR.
As mentioned in Maronna (2011) this version of robust ridge regression which we
called here briefly RRR estimator is also robust when p/n is large, and is resistant to both
bad leverage points and vertical outliers. In Maronna (2011) a simulation study for the
case n>p in which the proposed estimator RRR is seen to outperform its competitors
(other robust ridge estimators proposed before in literature). Moreover, in his study he
described a simulation for p>n in which the RRR estimator is seen to outperform
RSIMPLS method. In many applications in literature, as seen from this study too, PCR
and PLSR methods (and their robust counterparts) give close results. This comparative
study and their study show that RRR is a good alternative to the most popular robust
biased methods, namely RPCR and RSIMPLS methods. All of the three robust methods
RRR, PCR and PLSR could be used for both n>p or p>n situations and resist to different
types of outliers (bad leverage points and vertical outliers). Hence, generally, we could
mention that RRR is a good alternative but we could not declare that it always give better
results than RPCR and RSIMPLS as in many real data applications the situation could be
differ. However, in this study by applying on a real data set we showed that RPCR and
RSIMPLS are not the only methods to be preferred in case of multicollinearity and outlier
existence. Thus, especially for fields such as chemometrics, in which these kinds of data
sets (including multicollinearity and outliers) are seen usually with the case of p>>n,
RRR could be used as a good alternative to RPCR and RSIMPLS methods and the best
model for the purpose (fitting to data set or prediction) could be selected.
5. Conclusion
In this study, for determining unemployment rate in Turkey during 1985-2012
period, it is aimed to choose the best model by comparing classical biased RR, PCR and
PLSR methods and robust biased RRR, RPCR and RSIMPLS methods in case of both
multicollinearity and outliers existence. For the unemployment data set, RRR model is
chosen as the best model according to TRMSE(0.9) criteria. In addition, robust methods
RRR, RPCR and RSIMPLS outperform classical RR, PCR and PLSR methods in terms
of predictive ability. The results obtained from RRR robust biased regression method
showed that the most important independent variable effecting the unemployment rate is
Purchasing Power Parities (PPP). The least important variables effecting the
unemployment rate are Import Growth Rate (IMP) and Export Growth Rate (EXP),
respectively. Hence, any increment in PPP cause an important increment in
unemployment rate, however, any increment in IMP causes an unimportant increase in
unemployment rate. Any increment in EXP causes an unimportant decrease in
unemployment rate.
References
[1] Aktar, I. and Ozturk, L. (2009). Can Unemployment be Cured by Economic Growth
and Foreign Direct Investment in Turkey. International Research Journal of
Finance and Economics 27, 203-211.
[2] Aktas, C. (2007). oklu Bant ve Liu Kestiricisiyle Enflasyon Modeli iin Bir
Uygulama. ZK Sosyal Bilimler Dergisi, Cilt 3, Say 6, 6779.
[3] Aqil, M., Qureshi, M. A., Ahmed, R. R. and Qadeer, S. (2014). Determinants of
Unemployment in Pakistan. International Journal of Physical and Social Sciences
4:4.
[4] Berument, M. H., Dogan, N. and Tansel, A. (2006). Economic Performance and
Unemployment: Evidence from and Emerging Economy. International Journal of
Manpower 27(7), 604-623.
[5] Berument, M. H., Dogan, N. and Tansel, A. (2009). Macroeconomic Policy and
Unemployment by Economic Activity: Evidence from Turkey. Emerging Markets
Finance and Trade 45(3), 21-34.
765
766 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
[6] Bilgin, H. (2004). Dviz Kuru ve sizlik likisi: Trkiye zerine Bir nceleme.
Kocaeli niversitesi Sosyal Bilimler Enstits Dergisi 8(2), 1-15.
[7] Cascio, I. L. (2001). Do Labour Markets Really Matter? Monetary Shocks and
Asymetric Effects across Europe. Unpublished, Department of Economics,
University of Essex, Colchester.
[8] De Jong, S. (1993). SIMPLS: an alternative approach to partial least squares
regression, Chemometrics Intell. Lab. Syst. 18, 251263.
[9] Djvre, J and Ribon, S. (2003). Inflation, Unemployment. The Exhange Rate, and
Monetary Policy In Israel, 1990-1999: A SVAR Approach. Israel Economic Review
2, 1-29.
[10] Dogan, T. T. (2012). Macroeconomic Variables and Unemployment: The Case of
Turkey. International Journal of Economics and Financial Issues 2(1), 71-78.
[11] Dogrul, H. G. and Soytas, U. (2010). Relationship between Oil Prices, Interest Rate,
and Unemployment: Evidence from an Emerging Market. Energy Economics 32,
1523-1528.
[12] Engelen, S., Hubert, M., Vanden Branden, K. and Verboven, S. (2004). Robust PCR
and Robust PLSR: a comparative study. Theory and Applications of Recent Robust
Methods, Statistics for Industry and Technology, 105-117.
[13] Goktas, A. and Isci, O. (2010). Turkiyede ssizlik Oraninin Temel Bilesenli
Regresyon Analizi ile Belirlenmesi. Sosyal ve Ekonomik Arastirmalar DergisiSelcuk Universitesi 14:20, 279-294.
[14] Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for
Nonorthogonal Problems. Technometrics 12:1, 55-67.
[15] Hubert, M. and Verboven, S. (2003). A robust PCR method for high-dimensional
regressors. Journal of Chemometrics 17, 438452.
[16] Hubert, M. and Vanden Branden, K. (2003). Robust methods for Partial Least
Squares Regression. Journal of Chemometrics 17, 537549.
[17] Karanassou, M. and Sala, H. (2010). Labour Market Dynamics in Australia: What
Drives Unemployment? The Economic Record, The Economic Society of Australia,
vol. 86 (273), 185-209.
767
768 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
[29] Valletta, R. and Kuang, K. (2010). Is the Structural Unemployment on the Rise? The
Economic Report 86(273), 185-209.
[30] Verboven, S. and Hubert, M. (2005). LIBRA: a MATLAB Library for Robust
Analysis. Chemometrics and Intelligent Laboratory Systems 75, 127136.
[31] Walker, E., and Birch, J. B. (1988). Influence Measures in Ridge Regression.
Technometrics 30, 221-227.
[32] Ylmaz, . (2005). Trkiye Ekonomisinde Byme ile sizlik Oranlar Arasndaki
Nedensellik likisi, stanbul niversitesi ktisat Fakltesi Ekonometri ve statistik
Dergisi S.2: 11-29.
[33]
Data
sources:
http://www.mod.gov.tr/en/SitePages/mod_easi.aspx
and
http://www.yaklasim.com/BookList.aspx?AnnouncementId=2780&Announcemen
tCategoryId=8&_
Esra Polat
Department of Statistics
Hacettepe University, Faculty of Science
Beytepe, 06800, Ankara, Turkey.
espolat@hacettepe.edu.tr
Semra Turkan
Department of Statistics
Hacettepe University, Faculty of Science
Beytepe, 06800, Ankara, Turkey.
sturkan@hacettepe.edu.tr