9 JDS160301

Journal of Data Science 13(2014), 739-768
The Comparison of Classical and Robust Biased Regression Methods for

Determining Unemployment Rate in Turkey: Period of 1985-2012
Esra Polat1, Semra Turkan2
1,2
Department of Statistics, Hacettepe University
Abstract. Unemployment is one of the most important issues in macro economics.

Unemployment creates many economic and social problems in the economy. The
condition and qualification of labor force in a country show economical
developments. In the light of these facts, a developing country should overcome the
problem of unemployment. In this study, the performance of robust biased Robust
Ridge Regression (RRR), Robust Principal Component Regression (RPCR) and
RSIMPLS methods are compared with each other and their classical versions
known as Ridge Regression (RR), Principal Component Regression (PCR) and
Partial Least Squares Regression (PLSR) in terms of predictive ability by using
trimmed Root Mean Squared Error (TRMSE) statistic in case of both of
multicollinearity and outliers existence in an unemployment data set of Turkey.
Analysis results show that RRR model is chosen as the best model for determining
unemployment rate in Turkey for the period of 1985-2012. Robust biased RRR
method showed that the most important independent variable effecting the
unemployment rate is Purchasing Power Parities (PPP). The least important
variables effecting the unemployment rate are Import Growth Rate (IMP) and
Export Growth Rate (EXP). Hence, any increment in PPP cause an important
increment in unemployment rate, however, any increment in IMP causes an
unimportant increase in unemployment rate. Any increment in EXP causes an
unimportant decrease in unemployment rate.
Keywords: biased estimation, multicollinearity, outliers, partial least squares
regression, principal component regression, ridge regression, robust biased
estimation.
1. Introduction
In todays world, a country with unemployment that is resulted by the effects of
economical and social effects comes across multidimensional problems. The condition
and qualification of labor force in a country show economical developments. In the light
of these facts, a developing country should overcome the problem of unemployment.
According to Turkish Statistical Institute, active people in ages of between 15 and 60 that
are labor force consist of non-institutionalization population. Unemployment is defined
740 The Comparison of Classical and Robust Biased Regression Methods for Determining Unemployment
Rate in Turkey: Period of 1985-2012
as jobless who are looking for a job that offers the current fee level (Goktas and Isci,
2010).
In the literature, there are various factors that have effect on unemployment.
Cascio (2001) investigates monetary policy and unemployment relationship for 11
OECD countries over 1979:Q1-1998:Q4 by using Vector Autoregressive (VAR) model.
According to Cascio (2001), monetary shocks influence unemployment but they differ
from country to country. Namely, local factor(s) is/are important how a labor market is
influenced. Djivre and Ribon (2003) study monetary policy influence on unemployment,
inflation and exchange over 1990-1999 for Israel and they found that tight monetary
policy shocks increase unemployment. Ravn and Simonelli (2007) find that monetary
policy shocks extracted influence on unemployment for the United States over 1953:Q32003:Q1. Karanasou and Sala (2010) investigate driving forces behind unemployment
for Australia over time and find that reasons behind unemployment differ according to
period investigated. For example, 1970s driving force behind unemployment is oil shock
while in 1990s and 2000s; interest rate is important driving force. Yet, currently, the most
influential factor is the tight foreign demand due to global crisis. Further, another study
on unemployment by Valletta and Kuang (2010) shows that the recent increase in
unemployment is conjectural rather than structural for the United States. In general,
conjectural fluctuations, like fluctuation in exchange rate, international interest rate, and
decline in foreign demand are the shocks that extract influence on unemployment (Dogan,
2012).
There are no many macro empirical studies on unemployment in Turkey. Further,
in studies regarding Turkeys unemployment, structural breaks have not been considered
so as they have not been introduced in VAR specification. For Turkey, Berument et al.
(2006, 2009) and Berument (2008) investigates macroeconomic policy shocks on
unemployment by using VAR models. The general conclusion derived from those studies
is that positive income shocks reduce unemployment. Aktar and Ozturk (2009) study
interaction among macroeconomic variables for Turkey and find that positive income
shocks create statistically significant negative effect on unemployment. They, also, find
that export is not statistically significant influence on unemployment. Dogrul and Soytas
(2010) investigate relationships among unemployment, oil price and interest rate and find
that interest rate shocks left long-term impact on unemployment even though initial
impact on unemployment is negative and insignificant. The aim of this study is to
examine the factors affecting the unemployment with Partial Least Squares (PLS)
analysis in Turkey after the 2008 Global Financial Crisis. In their study macroeconomic
Esra Polat, Semra Turkan
741
variables; industrial productivity index, real wage index, growth rate, consumer price
index, the ratio of import to Gross National Product and the ratio of export to the Gross
National Product are used in modelling the response variable; the rate of unemployment.
Explanatory variables are taken for t time and eight term lags for 2005:Q1-2010:Q3. The
results of the analysis show that same results are obtained for industrial productivity
index and real wage index. The ratio of import to Gross National Product variable has a
great contribution in modelling unemployment rate for all terms except first and second
lags. Analysis results show that macroeconomic indicator: consumer price index has a
significant contribution in all terms (Dogan, 2012). Goktas and Isci (2010) aimed to
remove the collinearity on factors that affect the rate of unemployment and obtained the
new variables from the factors via using the principal components. The new variables
that are regressors are used in constructing of unemployment regression model. After
they have checked the assumptions of statistical inference, they forecasted the
unemployment in Turkey. Dogan (2012) investigates the response of unemployment to
selective macroeconomics shocks for the period of 2000:Q1-2010:Q1. It finds that
positive shocks to growth, growth in export and inflation reduce unemployment. On the
other hand, shocks to exchange rate, interbank interest rate and money supply increase
unemployment. The results are consistent with Phillips curve and Okuns Law suggestion.
Namely, negative relationship between output and unemployment and positive
relationship between unemployment and inflation are found. Umit and Bulut (2013)
examine the factors affecting the unemployment with Partial Least Squares (PLS)
analysis in Turkey after the 2008 Global Financial Crisis. In their study macroeconomic
variables; industrial productivity index, real wage index, growth rate, consumer price
index, the ratio of import to Gross National Product and the ratio of export to the Gross
National Product are used in modelling the response variable; the rate of unemployment.
Explanatory variables are taken for t time and eight term lags for 2005:Q1-2010:Q3. The
results of the analysis show that same results are obtained for industrial productivity
index and real wage index. The ratio of import to Gross National Product variable has a
great contribution in modelling unemployment rate for all terms except first and second
lags. Analysis results show that macroeconomic indicator: consumer price index has a
significant contribution in all terms.
In several linear regression and prediction problems, the independent variables
may be many and highly collinear. This phenomenon is called multicollinearity and it is
known that in the case of multicollinearity the Ordinary Least Squares (OLS) estimator
for the regression coefficients or predictor based on these estimates may give very poor
results. Therefore, several biased estimation methods have been developed to overcome
multicollinearity problem such as Ridge Regression (RR), Principal Component
Regression (PCR) and Partial Least Squares Regression (PLSR). The main goal of biased
methods is to decrease the mean squared error of prediction by introducing a reasonable
amount of bias into the model. In most real systems, exact collinearity of variables in X
is rather unusual, because of the presence of random experimental noise. Nevertheless,
in systems producing nearly collinear data, the solution for regression coefficient is
highly unstable, such that very small interferences in the original data (for example,
because of noise or experimental error) cause the method to produce madly different
results. In addition, the use of highly collinear variables in Multiple Linear Regression
(MLR) also increases the possibility of overfitting the model. Overfitting means that the
model may fit the data very well, but fails when used to predict the properties of new
samples (Martens and Naes, 1989; Naes et al., 2002, Polat and Gunay, 2015). In addition
to multicollinearity, outliers are also the other problem encountered in regression models.
If an observation does not follow the model that fits the majority of the data points, it is
an outlier. Its occurrence can, e.g. be due to a measurement error, to a change in the
experimental conditions or to the fact that the sample belongs to a population other than
the one under study etc. It is also well known that the MLR method is highly sensitive to
outliers. Both outliers in the space of the response variables and those in the space of the
explanatory variables can unduly influence the parameter estimates (Hubert and
Verboven, 2003). To solve this problem, the robust versions of regression models were
proposed. In the case of the presence of both multicollinearity and outliers, the robust
versions of the biased RR, PCR and PLSR methods called as Robust Ridge Regression
(RRR), Robust Principal Component Regression (RPCR) and RSIMPLS are used.
In this study, for determining unemployment rate in Turkey, as the presence of
simultaneously multicollinearity and outliers in the data set, performance of classical
biased RR, PCR and PLSR methods and their robust versions RRR, RPCR and RSIMPLS
are compared. Consequently, the method giving the best model is selected from among
these six methods. The rest of the paper is organized as follows. The classical biased RR,
PCR and PLSR methods are reviewed in Section 2. In Section 3, the robust versions of
these three methods RRR, RPCR and RSIMPLS are described. In Section 4, a real
unemployment data set of Turkey for the period of 1985-2012, in which both
multicollinearity and outliers exist, analyzed by using these six methods and the
performance of these methods are compared to each other by using trimmed Root Mean
Squared Error (TRMSE) statistic.
2. Classical biased estimation methods
743
There are several classical biased estimation methods in the literature. In this
study, the most commonly used three of them are used for analysis. These are RR, PCR
and PLSR. In this section, first of all, MLR is briefly outlined in order to clarify the cause
of using biased estimation methods in case of multicollinearity. Then, the most popular
biased estimation methods RR, PCR and PLSR are presented. The emphasis here is on
the algebraic derivation of the vector (or matrix) of coefficients in the linear regression
models for these four methods. Throughout this paper, matrices are denoted by bold
capital letters and vectors are denoted by bold lowercase letters.
2.1. Multiple Linear Regression
Firstly, the regression model used for this method is defined by Equation (1).
Traditionally, the most frequently used method for finding is the OLS. If X has a full
rank of p (number of independent variables) then the OLS estimate of given by
Equation (2).
gives unbiased estimates for the elements of . The corresponding
OLS
vector of fitted values obtained as in Equation (3) (Phatak and De Jong, 1997).
y X
(1)
1
OLS XX Xy
(2)
1
y OLS X OLS XXX Xy x y
(3)
In order to estimate , by using the OLS method, requires that the X variables
must be linearly independent and the number of independent variables, p, must be equal
or smaller than the number of observations, n p n (Trygg, 2002).
If there is a multicollinearity problem, the variance of the least squares (LS)
estimator may be very large and subsequent predictions rather inaccurate. However, if
insisting on unbiased estimators given up, biased methods can be used to overcome the
problem of inaccurate predictions. For this reason, biased methods such as RR, PCR and
PLSR are used with the consequent trade-off between increased bias and decreased
variance. The idea behind PCR and PLSR methods is to discard the irrelevant and
unstable information and to use only the most relevant part of the x-variation for
regression. Hence, the collinearity problem could be solved that more stable regression
equations and predictions obtained (Naes et al., 2002; Polat and Gunay, 2015).
2.2. Ridge Regression

The OLS estimator is an unbiased and has a minimum variance. However, when
multicollinearity exists, the matrix XX becomes ill-conditioned (singular). Since
1
Var OLS 2 XX and the diagonal elements of XX1 become quite large, this
makes the variance of

to be large. This leads to an unstable estimate of and some
OLS
of coefficients have wrong sign. In order to prevent these difficulties of OLS, Hoerl and
Kennard (1970) suggested RR as an alternative procedure to the OLS method in
regression analysis, especially, multicollinearity exists. RR is an estimation method when
there is multicollinearity in the data. In practical terms, it consists of adding a biasing
constant k to the diagonal elements of XX matrix, where X (x1 , ..., x p ) is a centered
and standardized matrix. The resulting estimators are more stable than the least square
(LS) estimators. RR estimator of can be expressed as in Equation (4) (Hoerl and
Kennard, 1970; Myers, 1990; Salh, 2014; Walker and Birch, 1988).
RR (XX k I)1 Xy
(4)
Here, k is ridge parameter, I is p p identity matrix and XX is the correlation

matrix of independent variables. Values of k lie in the range 0-1. Note that if k=0, the
ridge estimator ( RR ) becomes as the OLS estimator ( OLS ) (Myers, 1990; Salh, 2014).
The trick in RR is to determine the optimum value of k for developing a predictive
model. There are many procedures in the literature for determining the best value. Hoerl,
Kennard and Baldwin (1975) suggested using the criterion given in Equation (5) in order
to choose the optimum k value (Rawlings, 1988).
k ps 2 / OLS0 OLS0
(5)
Here, p is the number of regression vectors except the constant term ( 0 ), s 2 is

the estimated residual mean of squares in OLS method. The denominator of Equation (5)
shows the sum of squares of classic OLS regression parameters OLS0 , which are
calculated from centered and scaled independent variables and the constant term is
745
excluded in the calculation. The simplest way to determine the optimum value of k is to
plot the values of each RR versus k (in the range 0-1), which is called as ridge trace
(Myers, 1990). In ridge trace graph, a trace or a curve is formed for each of the
coefficients. From the ridge trace the minimum k value that makes the RR stable is could
be chosen and in this chosen k value the residual sum of squares could converge its
minimum value (Hoerl and Kennard, 1970). However, Van Nostrand (1980) stated that
since the determination of what is the stability in ridge trace is subjective and hence the
selection of k is arbitrary, there is a tendency to choose a very big value of k while
choosing k based on the ridge trace. Hence, the selection of k based on Equation (5) could
be better (Rawlings, 1988).
2.3. Principle Component Regression
PCR and PLSR methods assume that the p-dimensional independent x-variables
and a set of q-dimensional dependent y-variables are related through a bilinear model. n
is the number of observations and for i=1,,n this bilinear model is shown as in Equation
(6) and Equation (7). Here x is the mean of x variables, y the mean of the y variables,
~
ti are k dimensional scores with k<<p, Pp,k the matrix of x-loadings and A k , q represents
the slope matrix in the regression of yi on ~ti . The error terms are denoted by fi and gi. In
terms of the original independent variables, this bilinear model can be written as in
Equation (8). The main difference between PCR and PLSR lies in the construction of the
scores ~ti . In PCR the scores are obtained by extracting the most relevant information
present in the x-variables by performing a PCA on the independent variables, thus, using
a variance criterion. In contrast, the PLSR scores are calculated by maximizing a
covariance criterion between the x- and y-variables (Hubert and Verboven, 2003; Hubert
and Vanden Branden, 2003).
~
x i x Pp,k ti g i
(6)
~
yi y Aq ,k ti f i
(7)
yi 0 Bq,p x i ei
(8)
PCR starts by centering the data through the mean x of the x-variables and the
mean y of the y-variables. Let the centered observations be denoted by (Hubert and
Verboren, 2003)
~
xi x i x
(9)
~
yi yi y
(10)
Firstly, the first k principle components of X n p are computed to handle

~
multicollinearity. Loading vectors Pp, k (p1,..., pk ) are the k eigenvectors that
correspond to the k largest eigenvalues of the emprical covariance matrix,
1 ~~
Sx
XX (Hubert and Verboren, 2003).
n 1
~
Next the k-dimesional scores of each data point, ti , are computed as the coordinates of the projections of ~x i onto this subspace, or equivalently as in Equation (11).
~ ~~
ti Pxi
(11)
~
In the final step, ~y i is regressed onto ti via multiple linear regression. Fitted
linear model can be obtained as in below.
~
~
yi A ti ~
i
(12)
The parameter estimates and fitted values can be expressed as in Equation (13)
and Equation (14), respectively.
(TT)-1T~
A
y
k,q
(13)
~t y
y i A
i
(14)
The unknown regression parameter in Equation (1) without a constant term is

estimated as in Equation (15) (Hubert and Verboren, 2003).
747
~
PCR PA
(15)
2.4. Partial Least Square Regression

PLSR is a linear regression technique proposed to cope with high dimensional
regressors and one or several response variables. There are several ways to calculate
PLSR model parameters. One of the significant algorithms for PLSR, Straightforward
Implementation of a Statistically Inspired Modification of the Partial Least Squares
Method (SIMPLS), was proposed by Sijmen De Jong (1993). SIMPLS algorithm is the
leading PLSR algorithm because of its speed and efficiency. Therefore, in this study, the
PLSR with SIMPLS algorithm is introduced due to its speed and efficiency (Sijmen De
Jong, 1993; Hubert and Vanden Branden, 2003). The SIMPLS algorithm assumes that
the x and y variables are related with:
~
x i x P ti g i
~
yi y A ti fi
(16)
(17)
~
where ti are called scores, P is matrix of x loadings, A is the slope matrix in the
~
regression of y i on ti , g i and f i are residuals (Hubert and Vanden Branden, 2003).
3. Robust versions of bias estimation methods
Besides multicollinearity, OLS are considerably affected by only or few
observations. Namely, not all data points in a data set have the same significance in
determining estimates, test and other statistics. It is important that the data analyst should
be aware of such kind of points known as outliers. To overcome the effects of outliers,
robust regression models are proposed. In the presence of both multicollineriaty and
outliers, robust versions of bias estimation methods are proposed. The robust versions of
RR, PCR and PLSR are given follow.
3.1. Robust Ridge Regression
RR is based on least squares and it is sensitive to atypical observations. Hence,
the approach of MM estimation, which is repeated M estimation, is proposed by Maronna
(2011) to ensure both robustness and efficiency under the normal model. The vector of
residuals in RR is given by Equation (18). A scale M estimator of the data vector
is defined as the solution (e RR ) of Equation (19) (Maronna,

eRR e1RR ,, eRR
n
2011):
e RR e1RR ,, e RR
y X RR
n
(18)
1 n eiRR
n i 1
(19)
where is a bounded - function and (0,0.5) determines the breakdown point of .

Recall that the breakdown point (BDP) of an estimator is the maximum proportion of
observations that can be arbitrarily altered with the estimator remaining bounded away
from the border of the parameter set (which can be at infinity). When (t) t 2 and 1 ,
1 n RR 2
(ei ) is obtained. Maronna (2011) employed
n i 1
for the bisquare -function given in Equation (20) (Maronna, 2011).
the classical mean square error: 2
2 3
bis t min
1,1 1 t
eRR
(20)
Let ini be an initial estimator. Let ini be an M-scale estimator of the

eRR ( ) , as the solution of Equation (21). Here,
and are initial
ini
ini
ini
estimators, 0 is a bounded -function and is to be chosen. Then the MM estimator

for RR (in this study named as Robust Ridge Regression (RRR)) is defined by Equation
2
(22), where is another bounded -function such that 0 . The factor ini
before
the summation is employed to make the estimator coincide with the classical one when
(t) t 2 (Maronna, 2011).
1 n eiRR
0
n i 1 ini
n
2
L(X, y, ) ini
(
i 1
(21)
eiRR ()
2
)k
ini
(22)
749
Therefore, Equation (23) shall be used where the constants c 0 c are chosen to
control both robustness and efficiency (Maronna, 2011).
t
0 (t) bis ,
c0
t
(t) bis
c
(23)
It is known that the classical estimator RR satisfies the normal equations. A

similar system of equations is satisfied by RRR. Define (Maronna, 2011).
(t) (t) ,
W(t)
(t)
t
(24)
Let
ti
e iRR
W(t i )
, wi
, w ( w1,...., w n ) , W diag( w)
2
ini
(25)
Setting the derivatives of Equation (22) with respect to to zero yields for RRR
Equation (26) and Equation (27). Since for the chosen , W t is a decreasing function
of t observations with larger residuals will receive lower weights w i (Maronna, 2011).
W(y - X RR ) 0
(26)
(27)
(XT WX kI) RR XT Wy
As is usual in robust statistics, these weighted normal equations suggest an
iterative produce. Starting with an initial RR : Compute the residual vector e RR and the
weights w. In order to choose k, Cross-Validation (K), K-fold cross validation process,
which requires recomputing the estimate K times. y i which is the fit of y i computed
without using the i-th observation is expressed as in Equation (28). It is the RRR estimate
computed without observation i. Then a first-order Taylor approximation of the estimator
yields the approximate prediction errors as shown in Equation (29), where
h i x iT ( (t i )x i x iT 2kI) 1 x i . See the study of Maronna (2011) for details related

i 1
to choose k and constants (c or c0) (Maronna, 2011).
y i xiT -iRRR
eRRR
yi y i eiRRR (1
i
(28)
W( t i ) h i
)
1 hi(t i )
(29)
3.2. Robust Principal Component Regression

Outliers especially bad leverage points and vertical outliers are known to be very
influential for the classical least squares (LS) regression fit, because they cause the slope
to be tilted in order to accommodate the outliers. PCR combines PCA on the x-variables
with LS regression. However, both stages yield very unreliable results when the data set
contains outlying observations. Hence, Hubert and Verboven (2003) have been suggested
a robust version of PCR method called as RPCR (Hubert and Verboven, 2003).
y i y i y , then, performing
x i x i x and ~
PCR starts by centring the data as ~
a PCA on the x-variables in order to cope with multicollinearity. The PCA loading matrix
~
T
Pk , p p1,, pk then contains the first k dominant eigenvectors of the empirical
1 ~T ~
~ ~
X p, n Xn , p and the scores satisfy ti Pk,Tp ~
xi . In the second
n 1
~
~
y i are regressed onto ti as ~
step of PCR, the dependent variables ~
yi AT ti ~
i using
MLR. Then, the parameter estimates and fitted values are obtained as
~
TT1 TT Y
T ~
A
k,q
k , k k,n n , q and y i Aq, k ti y , respectively. The unknown regression
~
parameters in Equation (1) are then estimated as PCR PA
(Hubert and Verboven,
2003).
covariance matrix, S x
In Hubert and Verboven (2003) a robust PCR method is proposed by robustifying

both steps of PCR. Firstly, a robust PCA method is applied on the x-variables. Secondly,
a robust regression method is applied. In first step, for low-dimensional data (p<n/2), the
the highly robust MCD estimator is used as a robust estimator of the covariance matrix
of the xi and for high-dimensional data the ROBPCA method. To define the MCD
estimator, we consider subsets of size h out of the whole data set (of size n). The MCD
751
estimator then seeks that h subset whose classical covariance matrix has minimal
determinant. The number h determines the robustness of the estimator and should be at
least (n+p+1)/2. The MCD location estimate is given by the mean x h and the MCD
, multiplied by a consistency factor. A robust
scatter estimator by its covarince matrix
h
PCA method yields a tolerance ellipse which captures the covariance structure of the
majority of the data points. The robust tolerance ellipse is obtained by applying the highly
robust MCD estimator of location and scatter to the data, yielding MCD and
MCD and
by
plotting
the
points
x
whose
robust
distance
T 1 x is equal to 2 . Based on
Dx D x,
,
x
MCD
MCD
MCD
MCD
MCD
2, 0.975
the raw MCD estimate, a reweighting step can be added which increases the finite sample
efficiency considerably. In that case, each data point receives a weight one if it belongs
to the robust tolerance ellipse and a weight zero otherwise. The reweighted MCD
estimator then equals the classical mean and covariance matrix of the data points with
weight one. Furthermore, there is no distinction any more between the raw and the
reweighted MCD estimator, as it is assumed that always the reweighted one is used. The
first k eigenvectors of the MCD estimator, sorted in descending order of the eigenvalues,
then yield robust loadings (Engelen et al., 2004; Hubert and Verboven, 2003).
ROBPCA is a robust PCA method which combines projection pursuit ideas with
MCD covariance estimation in lower dimensions. Firstly, x data are preprocessed by
reducing their data space to the affine subspace spanned by the n observations. This can
be easily performed using a singular value decomposition of Xn,p. In the second step of
the ROBPCA algorithm, a measure of outlyingness is computed for each data point. This
is obtained by projecting the high-dimensional data points on many univariate directions.
On every direction a robust centre and scale of the projected data points is computed, and
for every data point its standardized distance to that centre is measured. Finally, for each
data point its largest distance over all the directions is considered. The h data points with
smallest outlyingness are then retained and from the covariance matrix of this the final h
subset, the number of principal components (PCs) to retain, k, is selected. ROBPCA
yields more accurate estimates at uncontaminated data sets and more robust estimates at
contaminated data sets. Briefly, ROBPCA method applied to X n ,p yields robust scores
can be derived as t i Pk,Tp xi x . Here, Pp,k is the loading matrix with orthogonal
columns and x is a robust centre (Engelen et al., 2004; Hubert and Verboven, 2003).
MCD scatter matrix defines a metric in the PCA subspace. Let L denote the
diagonal matrix which contains the eigenvalues l j of the MCD scatter matrix, sorted
from largest to smallest. Then, the score distance of a p-dimensional point x with respect
to x , P and L is defined as in Equation (30) (Hubert and Verboven, 2003).
SDx SDx, , P, L D t P T x x , 0, L
x x T Pp, k Lk1, k Pk,Tp x x
(30)
t 2j
t L t
j 1 l j
T
In the second stage of RPCR method y i is regressed on t i that if there is only one
y-variable the reweighted LTS regression is preferred, else the MCD regression is
performed. Here, the regression model with with intercept written as in Equation (31)
with Cov . In case of one dependent variable (q=1), this model simplifies as in
Equation (32) with scale of the errors. The parameters in (6) could be estimated by
using the LTS estimator. The raw LTS estimator minimizes the sum of the h smallest
squared residuals as shown in Equation (33). Here, r12:n r22:n rn2:n denote the
ordered squared residuals. An initial estimate of the error dispersion is given by Equation
(34). Here c h is a consistency factor for normally distributed errors. The reweighted LTS
estimator then corresponds to the LS estimator applied to the observations whose
absolute standardized residual is not too large. That means, if ri , 0 LTS / 0 2.5 it
0 final estimates are obtained as the

is set w i 0 and otherwise, w i 1 . Then, ,
n
vector which minimizes w i yi Tt i 0

i 1
(Hubert and Verboven, 2003).
y i 0 ATt i i
(31)
yi 0 T t i
(32)
, 0 LTS arg min r 2 , 0

h
, 0
i 1
i:n
(33)
753
1 h 2
r , 0 LTS
h i 1
0 c h
(34)
i:n
In case of q>1, the MCD regression estimator is used. It starts by computing the
reweighted MCD estimator on the t i , y i jointly, leading to a (k+q)-dimensional
location estimate , and a scatter estimate

, which can be split as
t
k q ,k q
shown in Equation (32). Similarly with the MLR estimates which are based on the
emprical covariance matrix of the joint t i , y i variables, robust parameter estimates are
then estimated as shown in Equation (33). This robust regression estimators efficiency
can also be increased by performing a reweighting step. To apply this reweighting
scheme, each data point receives a zero weight if it is initial residual distance is unsually
large as shown in Equation (34), with Equation (35) and Equation (36). All other
observations have a weight w i 1 (Hubert and Verboven, 2003). The reweighted MCD
regression parameters then correspond to the MLR estimates based on those observations
with weight one. The final residual distances are obtained by filling in the reweighted
estimates for A and 0 in Equation (33), Equation (35) and Equation (36). A different
notation for the final estimates and residual distances is not introduced, but it is assumed
that the reweighting step is indeed applied. The fitted values are obtained as in Equation

is set
(37) and regression parameters derived as in Equation (38). Finally,
(Hubert and Verboven, 2003).
MCD
yt
ty

(35)
T
0 y A
A
k ,q
t
ty

A
A
y
t
(36)
w i 0 if RD i q2,0.975
(37)
Tt
ri y i A
i
0
(38)
(39)
rT
1r
RDi D ri ,0,
i
i
T t
y i A
q, k i
0
(40)
T PT x
A
q, k k,p
i
x
0
P A
B
p ,q
p,k
k ,q

0 B
0
p ,q x
(41)
As for the MCD estimator, the robustness of the RPCR algorithm depends on the
value of h, which is chosen in the ROBPCA algorithm, LTS regression and MCD
regression. In MATLAB implementation the user can either choose h from the start or
choose a value of with 0.5 1 , such that 1 corresponds with the percentage of
outliers that the algorithm should be able to resist. Then, h is set as the maximum of
n k q 1
h 1 n and h 2
where h 2 is the required minimal value for the MCD
2
regression estimator to have a positive breakdown value. When a large proportion of

contamination is presumed, should be chosen close to 0.5. Otherwise an intermediate
value for , such as 0.75, is recommended because it increases the finite sample
efficiency. Therefore, the default choice is 0.75 (Engelen et al., 2004; Hubert and
Verboven, 2003).
3.3. Robust Partial Least Squares Regression

SIMPLS algorithm is the leading PLSR method because of its speed and
efficiency. It assumes that the x and y variables are related through a bilinear model as
~
n
given in (1) and (2). This bilinear structure implies a two-step algorithm. X x i x i 1
~
n
and Y y i y i 1 be the centered data matrices. After mean centering the data,
~
~
~
SIMPLS will first construct k latent variables (LVs) Tn,Tk t1,, tn and then the
dependent variables will be regressed on these k variables. k components, the columns

~
of Tn ,k , are obtained as a linear combination of the x-variables which has maximum
covariance with a certain linear combination of the y-variables. In order to obtain k
components, firstly, it is required to calculate weight vectors. Hence, the first normalized
~
~
PLSR weight vectors r1 and q1 are obtained as linear combinations of X and Y that
~
~
maximizes cov Yn ,q q1 , Xn,p r1 . The solution of this maximization problem is found by
taking r1 and q1 are obtained as the first left and right singular eigenvectors of
755
~ ~
STyx Sxy Xp,T n Yn ,q /n 1 , the cross-covariance matrix of the x- and y-variables. For
~
~
each observation the first coordinate of the score ti is computed as ti1 ~
xiTr1 . To obtain
~
the other PLSR weight vectors ra and qa for a 2, , k , the components Xrj are
n
required to be orthogonal. Hence, if we require that t ia t ib 0 and ab, a deflation of

i 1
the cross-covariance matrix Sxy provides the solutions for the other PLSR weight vectors.
This deflation is carried out by first calculating the x-loading p j S xrj / rjTS xrj with Sx
empirical covariance matrix of the x-variables. Next an orthonormal base 1,,a of

p1,,pa is constructed and Sxy is deflated as S axy S axy1 a aTS axy1 with S1xy S xy . In
general, PLSR weight vectors ra and qa are obtained as the left and right singular vectors
~
of S axy . The elements of the scores ti are then defined as linear combinations of the
~
~
~
mean-centered data, t ~
x Tr or equivalently T X R with R r ,, r .
ia
n ,k
n ,p
p,k
p,k
Finally, where the scores are k-dimensional, MLR is performed of the dependent
~
variables yi on these scores ti . Thus, the formal regression model under consideration in
Equation (42), where E f i 0 and Covf i f . MLR provides estimates as in
~
Equation (43), Equation (44) and Equation (45). By inserting ti R k,T p xi x in
Equation (42), estimates for the parameters in the original model in Equation (46) are
obtained as in Equation (47). Finally, also an estimate of e is provided by rewriting S f
TS B
(Engelen et al., 2004; Hubert and
in terms of the original parameters: S S B
e
Vanden Branden, 2003).
~
y i 0 Aq ,k ti f i
(42)
S 1 S R S R 1 R S
A
k ,q
t
ty
k,p x
p,k
k,p xy
(43)
~t
0 y A
q ,k
(44)
S A
Y Y A
T T A
Sf S y A
q,k t
k,q
q,n n ,q
q,k k,n n,k
k,q
(45)
y i 0 Bq,T p xi ei
(46)
R A
T
B
p ,q
p,k
k ,q and 0 y Bq, p x
(47)
Since SIMPLS is based on the empirical cross-covariance matrix between the yvariables and the x-variables and on linear LS regression, the results are also affected by
outliers in the data set. Hence, Hubert and Vanden Branden (2003) have been suggested
a robust version of SIMPLS method called as RSIMPLS (Hubert and Vanden Branden,
2003). A robust method RSIMPLS starts by applying ROBPCA on the x- and y-variables
~
in order to replace Sxy and Sx, which are used to calculate ti , by robust estimates and
then proceeds analogously to the SIMPLS algorithm. Similar to RPCR instead of MLR
a robust regression method (ROBPCA regression) is performed in the second stage
(Engelen et al., 2004; Hubert and Vanden Branden, 2003). To obtain robust scores, firstly,
ROBPCA is applied on Z n ,m X n ,p , Yn ,q . ROBPCA is robust covariance estimator for
high-dimensional data sets (m>n). ROBPCA combines the two approaches. Using
projection pursuit ideas, it computes the outlyingness of every data point and then
considers the empirical covariance matrix of the h data points with smallest outlyingness.
The data are then projected onto the subspace K 0 spanned by the k 0 m dominant
eigenvectors of this covariance matrix. Next the MCD method is applied to estimate the
center and scatter of the data in this low dimensional subspace. Finally these estimates
are back transformed to the original space and a robust estimate of the center z of Z n ,m
are obtained. This scatter matrix can be decomposed as

and of its scatter
z
Pz Lz Pz
with robust Z-eigenvectors Pmz ,k 0 and Z-eigenvalues diag L k 0 ,k 0 . Note
in decreasing
that the diagonal matrix Lz contains the k 0 largest eigenvalues of
z
order. Then Z-scores T z can be obtained from Tz Z 1n TZ Pz . After application of

ROBPCA on Z n ,m , this yields robust estimates z
T
x
T T
y
. can be split
and
z
z
and the PLS

as in Equation (48). The cross-covariance matrix xy is estimated by
xy
weight vectors ra are computed as in the SIMPLS algorithm, but now starting with
xy
r
instead of S xy . The x-loadings are defined as p j rjT
x j
r . Then the deflation of
x j
a is performed as in SIMPLS. In each step the robust scores are

the scatter matrix
xy
calculated as in Equation (49), where the x i are the robustly centered observations
(Engelen et al., 2004).
757
z

yx
xy
(48)
T
t ia x iT ra x i x ra
(49)
Once the robust scores are derived, a robust linear regression is performed. The
regression model based on robust scores is written as in Equation (50). In order to
estimate parameters in this model a robust regression method called ROBPCA regression
is used. This method uses additional information from the previous ROBPCA step
(Engelen et al., 2004).
yi 0 Aq,T k ti fi
(50)
Firstly, to obtain the robust scores t i ROBPCA has been applied to the (x,y)variables and a k 0 -dimensional subspace K 0 , which represented these (x,y)-variables
well, has been obtained. Because the scores were then constructed to summarize the most
important information given in the x-variables, it is expected that outliers with respect to
this k 0 -dimensional subspace are often also outlying in the t , y -space. Hence, the
center and scatter of the (t,y)-variables is estimated as the weighted mean and
covariance matrix of those t i , y i whose corresponding x i , y i are not outlying to K 0
as shown in Equation (51). If observation i is not identified as an outlier by applying
ROBPCA on x, y then w i 1 and otherwise w i 0 .
t n t n
w i i / w i
y i 1 y i i 1
yt
n ti
ty
w i t iT
i
1
yi
y
n
y Ti / w i 1
i 1
(51)
Two types of outliers: those which are outlying within K 0 and those which are
lying far from K 0 are could be identified by applying ROBPCA The first type of outliers
can be easily identified as those observations whose robust distance
Di k 0
t L
z T
i
z 1 z
i
t exceeds 2k 0 ,0.975 . To determine the second type of outliers, for
each data point its orthogonal distance OD i od2 od2 z 0.975 to the subspace K 0 is
considered. The distribution of these orthogonal distances is difficult to determine
exactly, however, motivated by the central limit theorem, it appears that the squared
orthogonal distances are roughly normally distributed. Hence, their center and variance
2
is estimated with the univariate MCD, yielding od2 and od
. Then if
2
ODi od2 od2 z 0.975 , w i is set to zero. Having identified the observations with
weight one, thus, and are computed from (51). Then these estimates are inserted in
Equations (43)-(45) and a reweighted MLR is performed. It is recommended to reweight
these initial regression estimates in order to improve the finite sample efficiency. Let ri ( k )
be the residual of the ith observation based on the initial estimates that were calculated
is the initial estimate for the covariance matrix of the errors,
with k components. If
f
then the robust distance of the residuals is defined as in Equation (52). The weights c i k
are computed as in Equation (53) with I the indicator function. The final regression
estimates are then calculated as in classical MLR, but only based on those observations
with weight c i k equal to one. This reweighting step has the advantage that it might again
include observations with w i 0 which are not regression outliers. The robust residual
distances RD i k are recomputed as in Equation (52) and also the weights c i k are
adapted. Robust parameters for the original model in Equation (46) are then given by
Equation (54).
1r
RDik riTk
f i k
1/ 2
c i k I RD i2k q2,0.975

0 0 B
x
(52)
R A
B
p ,q
p,k
k ,q
(53)
e
f
(54)
4. Application on a real unemployment data set of Turkey

In this study, the annual data set of Turkey (1985-2012), including the eleven
factors affecting the unemployment rate, is analyzed by using RR, PCR, PLSR, RRR,
RPCR, RSIMPLS methods. The unemployment rate (UR) is considered as dependent
variable. The variables that explain the unemployment are determined by following the
previous studies on this issue. These independent variables are Gross Domestic Product
Growth Rate with Expenditure Approach (GDP), Import Growth Rate (IMP), Export
Growth Rate (EXP), Population Growth Rate (P), Exchange Rate (E), Consumer Price
Index (CPI), Relative Consumer Price Indices (RCPI), Civilization of Employment
759
Agriculture (CEA), Unit Labor Cost (ULC), Purchasing Power Parities (PPP), Trend (T).
The data set is taken from OECD data base.
Firstly, MLR analysis applied on the data set and it is found that the model
obtained by using OLS method is significant with a probability of 95% (F=4.95; p=0.002).
Even though the MLR model fits the data well, multicollinearity may severely prohibit
quality of the prediction. Table 1 shows that all independent variables are not significant
as an indicator of multicollinearity problem. Since 1 , , p are the eigenvalues of
correlation matrix of X X , the existence of collinearity problem also could be seen by
examining the condition number that calculated as max / min =9.112/0.0003=30373. The
condition number greater than 30 means that there is multicollinearity. The other
multicollinearity measure is Variance Inflation factor (VIF). The larger the VIF value,
the more serious the collinearity problem. In practice, if any of the VIF values is equal
or larger than 10, there is a near-collinearity. In this case, the regression coefficients are
not reliable. The VIF values of GDP, IMP, EXP, P, E, CPI, RCPI, CEA, ULC, PPP and
T are found as 3.32, 9.79, 6.05, 6.79, 39.25, 32.99, 22.25, 67.86, 2.98, 157.42 and 51.57,
respectively. Since the VIF values of E, CPI, RCPI, CEA, PPP and T are found greater
than 10, there is a near-collinearity problem for this data set.
Table 1: The estimated regression coefficients for the MLR model
Model
Coefficients
Constant
Trend
GDP
EXP
IMP
P
E
PPP
CPI
RCPI
CEA
ULC
24.866
-0.3588
-0.11589
-0.01865
0.00542
1.120
-2.001
12.982
0.02521
-0.11798
-0.2399
0.03096
Standart Error
of Coefficients
9.464
0.1838
0.08178
0.01123
0.01735
1.787
1.867
6.365
0.03897
0.06699
0.1841
0.01681
2.63
-1.95
-1.42
-1.66
0.31
0.63
-1.07
2.04
0.65
-1.76
-1.30
1.84
0.018
0.069
0.176
0.116
0.759
0.540
0.300
0.058
0.527
0.097
0.211
0.084
Secondly, whether outliers exist or not is examined using normal Q-Q plot of the
MLR residuals given in Figure 1. As seen from Figure 1, there are two outliers in the
data.
MLR
2.5
2
1.5
residuals
1
0.5
0
-0.5
-1
-1.5
-2
-3
-2
-1
Norm al quantile s
Figure 1: Q-Q plot for MLR residuals

It is found that the data set simultaneously includes multicollinearity and outliers.
Hence, RR, PCR, PLSR, RRR, RPCR and RSIMPLS methods are applied on the data set.
Firstly, the Q-Q plots for residuals of methods are obtained in Figure 2. As seen from
Figure 2, there are two outliers as to all methods. In addition, classical RR, PCR and
PLSR methods are relatively much more dispersed than the ones corresponding to the
RRR, RPCR and RSIMPLS.
761
RR
2.5
2
1.5
residuals
1
0.5
0
-0.5
-1
-1.5
-2
-3
-2
-1
Norm al quantiles
RRR
5
4
3
residuals
2
1
0
-1
-2
-3
-2
-1
0
Norm al quantile s
PCR
2.5
2
1.5
1
residuals
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
Norm al quantiles
RPCR
4
3
2
residuals
1
0
-1
-2
-3
-3
-2
-1
0
Norm al quantiles
763
SIMPLS
2.5
2
1.5
1
residuals
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
Norm al quantiles
RSIMPLS
5
4
3
residuals
2
1
0
-1
-2
-3
-4
-3
-2
-1
0
Norm al quantiles
Figure 2: Residuals Q-Q Plots of the Classical and Robust Biased Regression Methods
The predictive performance of the methods are evaluated by using the Root Mean
n
Square Error (RMSE), RMSE
y
i 1
y i
) with upper %10 trimming which is

n
considered to be safer in the presence of outliers. The exclusion of a certain percentage
of unusually large residuals leads to an acceptable robust performance criterion. The
regression coefficients and TRMSE values of methods are given in Table 2.
Table 2: The estimated regression coefficients and TRMSE values of methods
Regression Coefficients
Variables
Constant
T
GDP
RR
17.9265
-.0976
-.0819
PCR
25.6056
-.1611
-.0929
PLSR
28.2426
-.2322
-.0895
RRR
23.2473
-.4540
-.0430
RPCR
27.1417
-.3524
-.1003
RSIMPLS
13.4399
-.4302
-.0937
IMP
EXP
P
E
PPP
CPI
RCPI
CEA
RULC
-.0007
-.0076
.1890
-.2168
3.1610
-.0153
-.0526
-.1119
.0130
Biasing Parameters
k RR 0.5
TRMSE(0.9)
0.6935
.0004
-.0094
-.0903
.6063
.1285
-.0204
-.0644
-.2333
.0142
.0003
-.0136
1.2939
1.0358
1.1919
.0001
-.0913
-.3283
.0212
.0048
-.0073
-1.1617
-1.0185
11.9718
.0189
-.0829
-.1669
.0272
.0162
-.0054
-.2164
2.3267
.5134
-.0167
-.0680
-.2870
.0256
.0246
.0017
-.1439
3.2841
.9214
-.0077
.0482
-.1663
.0090
0.5059
0.5984
k RR 11
0.6640
0.6575
0.3240
Since there is both multicollinearity and outliers in the data set, it is expected that
robust biased regression methods give more accurate results than their classical versions.
As seen from Table 2, the smallest TRMSE values are obtained using robust RRR and
RPCR methods, respectively. According to TRMSE values, robust methods RRR, RPCR
and RSIMPLS are clearly better than their classical counterparts RR, PCR and PLSR.
As mentioned in Maronna (2011) this version of robust ridge regression which we
called here briefly RRR estimator is also robust when p/n is large, and is resistant to both
bad leverage points and vertical outliers. In Maronna (2011) a simulation study for the
case n>p in which the proposed estimator RRR is seen to outperform its competitors
(other robust ridge estimators proposed before in literature). Moreover, in his study he
described a simulation for p>n in which the RRR estimator is seen to outperform
RSIMPLS method. In many applications in literature, as seen from this study too, PCR
and PLSR methods (and their robust counterparts) give close results. This comparative
study and their study show that RRR is a good alternative to the most popular robust
biased methods, namely RPCR and RSIMPLS methods. All of the three robust methods
RRR, PCR and PLSR could be used for both n>p or p>n situations and resist to different
types of outliers (bad leverage points and vertical outliers). Hence, generally, we could
mention that RRR is a good alternative but we could not declare that it always give better
results than RPCR and RSIMPLS as in many real data applications the situation could be
differ. However, in this study by applying on a real data set we showed that RPCR and
RSIMPLS are not the only methods to be preferred in case of multicollinearity and outlier
existence. Thus, especially for fields such as chemometrics, in which these kinds of data
sets (including multicollinearity and outliers) are seen usually with the case of p>>n,
RRR could be used as a good alternative to RPCR and RSIMPLS methods and the best
model for the purpose (fitting to data set or prediction) could be selected.
5. Conclusion
In this study, for determining unemployment rate in Turkey during 1985-2012
period, it is aimed to choose the best model by comparing classical biased RR, PCR and
PLSR methods and robust biased RRR, RPCR and RSIMPLS methods in case of both
multicollinearity and outliers existence. For the unemployment data set, RRR model is
chosen as the best model according to TRMSE(0.9) criteria. In addition, robust methods
RRR, RPCR and RSIMPLS outperform classical RR, PCR and PLSR methods in terms
of predictive ability. The results obtained from RRR robust biased regression method
showed that the most important independent variable effecting the unemployment rate is
Purchasing Power Parities (PPP). The least important variables effecting the
unemployment rate are Import Growth Rate (IMP) and Export Growth Rate (EXP),
respectively. Hence, any increment in PPP cause an important increment in
unemployment rate, however, any increment in IMP causes an unimportant increase in
unemployment rate. Any increment in EXP causes an unimportant decrease in
unemployment rate.
References
[1] Aktar, I. and Ozturk, L. (2009). Can Unemployment be Cured by Economic Growth
and Foreign Direct Investment in Turkey. International Research Journal of
Finance and Economics 27, 203-211.
[2] Aktas, C. (2007). oklu Bant ve Liu Kestiricisiyle Enflasyon Modeli iin Bir
Uygulama. ZK Sosyal Bilimler Dergisi, Cilt 3, Say 6, 6779.
[3] Aqil, M., Qureshi, M. A., Ahmed, R. R. and Qadeer, S. (2014). Determinants of
Unemployment in Pakistan. International Journal of Physical and Social Sciences
4:4.
[4] Berument, M. H., Dogan, N. and Tansel, A. (2006). Economic Performance and
Unemployment: Evidence from and Emerging Economy. International Journal of
Manpower 27(7), 604-623.
[5] Berument, M. H., Dogan, N. and Tansel, A. (2009). Macroeconomic Policy and
Unemployment by Economic Activity: Evidence from Turkey. Emerging Markets
Finance and Trade 45(3), 21-34.
765
[6] Bilgin, H. (2004). Dviz Kuru ve sizlik likisi: Trkiye zerine Bir nceleme.
Kocaeli niversitesi Sosyal Bilimler Enstits Dergisi 8(2), 1-15.
[7] Cascio, I. L. (2001). Do Labour Markets Really Matter? Monetary Shocks and
Asymetric Effects across Europe. Unpublished, Department of Economics,
University of Essex, Colchester.
[8] De Jong, S. (1993). SIMPLS: an alternative approach to partial least squares
regression, Chemometrics Intell. Lab. Syst. 18, 251263.
[9] Djvre, J and Ribon, S. (2003). Inflation, Unemployment. The Exhange Rate, and
Monetary Policy In Israel, 1990-1999: A SVAR Approach. Israel Economic Review
2, 1-29.
[10] Dogan, T. T. (2012). Macroeconomic Variables and Unemployment: The Case of
Turkey. International Journal of Economics and Financial Issues 2(1), 71-78.
[11] Dogrul, H. G. and Soytas, U. (2010). Relationship between Oil Prices, Interest Rate,
and Unemployment: Evidence from an Emerging Market. Energy Economics 32,
1523-1528.
[12] Engelen, S., Hubert, M., Vanden Branden, K. and Verboven, S. (2004). Robust PCR
and Robust PLSR: a comparative study. Theory and Applications of Recent Robust
Methods, Statistics for Industry and Technology, 105-117.
[13] Goktas, A. and Isci, O. (2010). Turkiyede ssizlik Oraninin Temel Bilesenli
Regresyon Analizi ile Belirlenmesi. Sosyal ve Ekonomik Arastirmalar DergisiSelcuk Universitesi 14:20, 279-294.
[14] Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for
Nonorthogonal Problems. Technometrics 12:1, 55-67.
[15] Hubert, M. and Verboven, S. (2003). A robust PCR method for high-dimensional
regressors. Journal of Chemometrics 17, 438452.
[16] Hubert, M. and Vanden Branden, K. (2003). Robust methods for Partial Least
Squares Regression. Journal of Chemometrics 17, 537549.
[17] Karanassou, M. and Sala, H. (2010). Labour Market Dynamics in Australia: What
Drives Unemployment? The Economic Record, The Economic Society of Australia,
vol. 86 (273), 185-209.
767
[18] Maronna, R. A. (2011). Robust ridge regression for high-dimensional data,

Technometrics 53(1), 44-53.
[19] Martens, H. and Naes, T. (1989). Multivariate Calibration, New York, Brisbane,
Toronto, Singapore: John Wiley & Sons.
[20] Mayers, R. H. (1990). Classical and modern regression with applications, 2nd
edition, Duxbury Press.
[21] Naes, T., Isaksson, T., Fearn, T. and Davies, T. (2002). A User-Friendly Guide to
Multivariate Calibration and Classification. UK: NIR Publications Chichester.
[22] Phatak, A. and De Jong, S. (1997). The geometry of partial least squares. Journal of
Chemometrics 11, 311338.
[23] Polat, E. and Gunay, S. (2015). The Comparison of Partial Least Squares Regression,
Principal Component Regression and Ridge Regression with Multiple Linear
Regression for Predicting PM10 Concentration Level Based on Meteorological
Parameters. Journal of Data Science 13, 663-692.
[24] Ravn, M. O. and Simonelli, S. (2007). Labor Market Dynamics and the Business
Cycle: Structural Evidence for the United States. Working Paper Series, no 182,
Centre for Studies in Economics and Finance.
[25] Rawlings, J. O. (1988). Applied Regression Analysis: A Research Tool, Pacific
Grove, California: Wadsworth & Brooks/Cole Advanced Books & Software.
[26] Salh, S. M. (2014). Using Ridge Regression model to solving multicollinearity
problem. International Journal of Scientific & Engineering Research 5:10, 992998.
[27] Trygg, J. (2002). Have you ever wondered why PLS sometimes needs more than
one component for a single-y vector? Chemometrics Homepage, February 2002,
http://www.chemometrics.se/editorial/feb2002.html.
[28] Umit, A. O. and Bulut, E. (2013). Trkiyede sizlii Etkileyen Faktrlerin Ksmi
En Kk Kareler Regresyon Yntemi le Analizi: 2005-2010 Dnemi,
Dumlupnar niversitesi Sosyal Bilimler Dergisi 37, 131-142.
[29] Valletta, R. and Kuang, K. (2010). Is the Structural Unemployment on the Rise? The
Economic Report 86(273), 185-209.
[30] Verboven, S. and Hubert, M. (2005). LIBRA: a MATLAB Library for Robust
Analysis. Chemometrics and Intelligent Laboratory Systems 75, 127136.
[31] Walker, E., and Birch, J. B. (1988). Influence Measures in Ridge Regression.
Technometrics 30, 221-227.
[32] Ylmaz, . (2005). Trkiye Ekonomisinde Byme ile sizlik Oranlar Arasndaki
Nedensellik likisi, stanbul niversitesi ktisat Fakltesi Ekonometri ve statistik
Dergisi S.2: 11-29.
[33]
Data
sources:
http://www.mod.gov.tr/en/SitePages/mod_easi.aspx
and
http://www.yaklasim.com/BookList.aspx?AnnouncementId=2780&Announcemen
tCategoryId=8&_
Esra Polat
Department of Statistics
Hacettepe University, Faculty of Science
Beytepe, 06800, Ankara, Turkey.
espolat@hacettepe.edu.tr
Semra Turkan
Department of Statistics
Hacettepe University, Faculty of Science
Beytepe, 06800, Ankara, Turkey.
sturkan@hacettepe.edu.tr

9 JDS160301

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 JDS160301

Uploaded by

Copyright:

Available Formats

Journal of Data Science 13(2014), 739-768

The Comparison of Classical and Robust Biased Regression Methods for

Department of Statistics, Hacettepe University

Abstract. Unemployment is one of the most important issues in macro economics.

Esra Polat, Semra Turkan

Esra Polat, Semra Turkan

2.2. Ridge Regression

makes the variance of

Here, k is ridge parameter, I is p p identity matrix and XX is the correlation

Here, p is the number of regression vectors except the constant term ( 0 ), s 2 is

Esra Polat, Semra Turkan

Firstly, the first k principle components of X n p are computed to handle

The unknown regression parameter in Equation (1) without a constant term is

Esra Polat, Semra Turkan

2.4. Partial Least Square Regression

is defined as the solution (e RR ) of Equation (19) (Maronna,

where is a bounded - function and (0,0.5) determines the breakdown point of .

Let ini be an initial estimator. Let ini be an M-scale estimator of the

estimators, 0 is a bounded -function and is to be chosen. Then the MM estimator

Esra Polat, Semra Turkan

It is known that the classical estimator RR satisfies the normal equations. A

h i x iT ( (t i )x i x iT 2kI) 1 x i . See the study of Maronna (2011) for details related

to choose k and constants (c or c0) (Maronna, 2011).

3.2. Robust Principal Component Regression

In Hubert and Verboven (2003) a robust PCR method is proposed by robustifying

Esra Polat, Semra Turkan

x x T Pp, k Lk1, k Pk,Tp x x

0 final estimates are obtained as the

vector which minimizes w i yi Tt i 0

(Hubert and Verboven, 2003).

, 0 LTS arg min r 2 , 0

Esra Polat, Semra Turkan

location estimate , and a scatter estimate

(Hubert and Verboven, 2003).

regression estimator to have a positive breakdown value. When a large proportion of

3.3. Robust Partial Least Squares Regression

dependent variables will be regressed on these k variables. k components, the columns

Esra Polat, Semra Turkan

required to be orthogonal. Hence, if we require that t ia t ib 0 and ab, a deflation of

empirical covariance matrix of the x-variables. Next an orthonormal base 1,,a of

Vanden Branden, 2003).

are obtained. This scatter matrix can be decomposed as

with robust Z-eigenvectors Pmz ,k 0 and Z-eigenvalues diag L k 0 ,k 0 . Note

order. Then Z-scores T z can be obtained from Tz Z 1n TZ Pz . After application of

and the PLS

r . Then the deflation of

a is performed as in SIMPLS. In each step the robust scores are

Esra Polat, Semra Turkan

t exceeds 2k 0 ,0.975 . To determine the second type of outliers, for

4. Application on a real unemployment data set of Turkey

Esra Polat, Semra Turkan

Figure 1: Q-Q plot for MLR residuals

Esra Polat, Semra Turkan

Esra Polat, Semra Turkan

Square Error (RMSE), RMSE

) with upper %10 trimming which is

Esra Polat, Semra Turkan

Esra Polat, Semra Turkan

[18] Maronna, R. A. (2011). Robust ridge regression for high-dimensional data,

You might also like