Professional Documents
Culture Documents
Sylvain Arlot,
CNRS ; Willow Proje
t-Team,
Laboratoire d'Informatique de l'E
ole Normale Superieure
(CNRS/ENS/INRIA UMR 8548)
45, rue d'Ulm, 75 230 Paris, Fran
e
Sylvain.Arlotens.fr
Alain Celisse,
Laboratoire Paul Painlev, UMR CNRS 8524,
Universit des S
ien
es et Te
hnologies de Lille 1
F-59 655 Villeneuve dAs
q Cedex, Fran
e
Alain.Celissemath.univ-lille1.fr
Abstra
t
Used to estimate the risk of an estimator or to perform model sele
-
and its apparent universality. Many results exist on the model sele tion
these results to the most re ent advan es of model sele tion theory, with a
the best ross-validation pro edure a ording to the parti ular features of
Contents
1 Introdu
tion 3
1.1 Statisti
al framework . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Statisti
al algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
3 Overview of some model sele
tion pro
edures 8
3.1 The unbiased risk estimation prin
iple . . . . . . . . . . . . . . . 9
3.2 Biased estimation of the risk . . . . . . . . . . . . . . . . . . . . 10
3.3 Pro
edures built for identi
ation . . . . . . . . . . . . . . . . . . 11
3.4 Stru
tural risk minimization . . . . . . . . . . . . . . . . . . . . . 11
3.5 Ad ho
penalization . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Where are
ross-validation pro
edures in this pi
ture? . . . . . . 12
2
1 Introdu
tion
Many statisti
al algorithms, su
h as likelihood maximization, least squares and
empiri
al
ontrast minimization, rely on the preliminary
hoi
e of a model, that
is of a set of parameters from whi
h an estimate will be returned. When several
andidate models (thus algorithms) are available,
hoosing one of them is
alled
the model sele
tion problem.
Cross-validation (CV) is a popular strategy for model sele
tion, and more
generally algorithm sele
tion. The main idea behind CV is to split the data (on
e
or several times) for estimating the risk of ea
h algorithm: Part of the data (the
training sample) is used for training ea
h algorithm, and the remaining part
(the validation sample) is used for estimating the risk of the algorithm. Then,
CV sele
ts the algorithm with the smallest estimated risk.
Compared to the resubstitution error, CV avoids overtting be
ause the
training sample is independent from the validation sample (at least when data
are i.i.d.). The popularity of CV mostly
omes from the generality of the data
splitting heuristi
s, whi
h only assumes that data are i.i.d.. Nevertheless, the-
oreti
al and empiri
al studies of CV pro
edures do not entirely
onrm this
universality. Some CV pro
edures have been proved to fail for some model
sele
tion problems, depending on the goal of model sele
tion: estimation or
identi
ation (see Se
tion 2). Furthermore, many theoreti
al questions about
CV remain widely open.
The aim of the present survey is to provide a
lear pi
ture of what is known
about CV, from both theoreti
al and empiri
al points of view. More pre
isely,
the aim is to answer the following questions: What is CV doing? When does
CV work for model sele
tion, keeping in mind that model sele
tion
an target
dierent goals? Whi
h CV pro
edure should be used for ea
h model sele
tion
problem?
The paper is organized as follows. First, the rest of Se
tion 1 presents the
statisti
al framework. Although non exhaustive, the present setting has been
hosen general enough for sket
hing the
omplexity of CV for model sele
tion.
The model sele
tion problem is introdu
ed in Se
tion 2. A brief overview of
some model sele
tion pro
edures that are important to keep in mind for un-
derstanding CV is given in Se
tion 3. The most
lassi
al CV pro
edures are
dened in Se
tion 4. Sin
e they are the keystone of the behaviour of CV for
model sele
tion, the main properties of CV estimators of the risk for a xed
model are detailed in Se
tion 5. Then, the general performan
es of CV for
model sele
tion are des
ribed, when the goal is either estimation (Se
tion 6) or
identi
ation (Se
tion 7). Spe
i
properties of CV in some parti
ular frame-
works are dis
ussed in Se
tion 8. Finally, Se
tion 9 fo
uses on the algorithmi
omplexity of CV pro
edures, and Se
tion 10
on
ludes the survey by ta
kling
several pra
ti
al questions about CV.
3
The quality of t S, as an approximation of s, is measured by its loss L ( t ),
where L : S 7 R is
alled the loss fun
tion, and is assumed to be minimal for
t = s. Many loss fun
tions
an be
hosen for a given statisti
al problem.
Several
lassi
al loss fun
tions are dened by
L ( t ) = LP ( t ) := EP [ ( t; ) ] , (1)
R( sb ) = E1 ,...,n P [ ( s, sb ) ] .
1.2 Examples
The purpose of this subse
tion is to show that the framework of Se
tion 1.1
in
ludes several important statisti
al frameworks. This list of examples does
not pretend to be exhaustive.
i, Yi = s(Xi ) + i with E [ i | Xi ] = 0 .
A popular
ontrast in regression is the least-squares
ontrast ( t; (x, y) ) =
(t(x) y)2 , whi
h is minimal over S for t = s, and the ex
ess loss is
h i
2
( s, t ) = E(X,Y )P ( s(X) t(X) ) .
4
Note that the ex
ess loss of t is the square of the L2 distan
e between t and s,
so that predi
tion and estimation are equivalent goals.
Classi
ation
orresponds to nite Y (at least dis
rete). In parti
ular, when
Y = { 0, 1 }, the predi
tion problem is
alled binary (supervised)
lassi
ation.
With the 0-1
ontrast fun
tion (t; (x, y)) = 1lt(x)6=y , the minimizer of the loss
is the so-
alled Bayes
lassier s dened by
s(x) = 1l(x)1/2 ,
n n
1X 1X
t 7 LPn ( t ) = ( t; i ) , where Pn = ,
n i=1 n i=1 i
over S . The idea is that the empiri
al
ontrast LPn ( t ) has an expe
tation
LP ( t ) whi
h is minimal over S at s. Hen
e, minimizing LPn ( t ) over a set S of
andidate values for s hopefully leads to a good estimator of s. Let us now give
three popular examples of empiri
al
ontrast minimizers:
5
Least-squares estimators: (t; (x, y)) = (t(x) y)2 the least-squares
take
ontrast in the regression setting. For instan
e, S
an be the set of pie
e-
wise
onstant fun
tions on some xed partition of X (leading to regresso-
grams), or a ve
tor spa
e spanned by the rst ve
tors of wavelets or Fourier
basis, among many others. Note that regularized least-squares algorithms
su
h as the Lasso, ridge regression and spline smoothing also are least-
squares estimators, the model S being some ball of a (data-dependent)
1 2
radius for the L (resp. L ) norm in some high-dimensional spa
e. Hen
e,
tuning the regularization parameter for the LASSO or SVM, for instan
e,
amounts to perform model sele
tion from a
olle
tion of models.
6
On the
ontrary, when Sm is huge, its bias ( s, Sm ) is small for most
targets s, but sbm
learly overts. Think for instan
e of Sm as the set of all
ontinuous fun
tions on [0, 1] in the regression framework. More generally, if
Sm is a ve
tor spa
e of dimension Dm , in several
lassi
al frameworks,
Before giving examples of
lassi
al model sele
tion pro
edures, let us mention
the two main dierent goals that model sele
tion
an target: estimation and
identi
ation.
7
Se
ond, in the non-asymptoti
framework, a model sele
tion pro
edure sat-
ises an ora
le inequality with
onstant Cn 1 and remainder term Rn 0
when
s, sbm(D
b n)
(Dn ) Cn inf { ( s, sbm (Dn ) ) } + Rn (4)
mMn
holds either in expe
tation or with large probability (that is, a probability larger
2
than 1 C /n , for some positive
onstant C ). Note that if (4) holds on
a large probability event with Cn tending to 1 when n tends to innity and
Rn ( s, sbm (Dn ) ), then the model sele
tion pro
edure m
b is e
ient.
In the estimation setting, model sele
tion is often used for building adaptive
estimators, assuming that s belongs to some fun
tion spa
e T (Barron et al.,
1999).Then, a model sele
tion pro
edure m
b is optimal when it leads to an estima-
tor sbm(D
b n)
(Dn ) (approximately) minimax with respe
t to T without knowing
, provided the family ( Sm )mMn has been well-
hosen.
P ( m(D
b n ) = m0 ) 1 .
n
8
Like CV, all the pro
edures
onsidered in this se
tion sele
t
m(D
b n ) arg min { crit(m; Dn ) } , (5)
mMn
m(D
b n ) arg min { LPn ( sbm ) + pen(m; Dn ) } . (6)
mMn
This se
tion does not pretend to be exhaustive. Completely dierent approa
hes
exist for model sele
tion, su
h as the Minimum Des
ription Length (MDL)
(Rissanen, 1983), and the Bayesian approa
hes. The interested reader will
nd more details and referen
es on model sele
tion pro
edures in the books
by Burnham and Anderson (2002) or Massart (2007) for instan
e.
Let us fo
us here on ve main
ategories of model sele
tion pro
edures, the
rst three ones
oming from a
lassi
ation made by Shao (1997) in the linear
regression framework.
( s, sbm
b ) + crit(m)
b LP ( sbm
b ) ( s, s
bm ) + crit(m) LP ( sbm ) . (7)
crit(m) LP ( sbm )
m Mn , +
n
n > 1
( s, sbm )
Several lassi al penalization pro edures follow this prin iple, for instan e:
9
With the log-likelihood
ontrast, AIC (Akaike Information Criterion,
Akaike, 1973) and its
orre
ted versions (Sugiura, 1978; Hurvi
h and Tsai,
1989).
AIC, Mallows' Cp and related pro
edures have been proved to be optimal
for estimation in several frameworks, provided Card(Mn ) Cn for some
onstants C, 0 (see the paper by Birg and Massart, 2007, and referen
es
therein).
When the goal is estimation, there are two main reasons for using biased
model sele
tion pro
edures. First, experimental eviden
e show that overpenal-
izing often yields better performan
e when the signal-to-noise ratio is small (see
for instan
e Arlot, 2007, Chapter 11).
Se
ond, when the number of models Card(Mn ) grows faster than any power
of n, as in the
omplete variable sele
tion problem with n variables, then the
unbiased risk estimation prin
iple fails. From the penalization point of view,
Birg and Massart (2007) proved that when Card(Mn ) = en for some > 0,
10
the minimal amount of penalty required so that an ora
le inequality holds with
Cn = O(1) is mu
h larger than penid (m). In addition to the FPE and GIC
with suitably
hosen , , several penalization pro
edures have been proposed
for taking into a
ount the size of Mn (Barron et al., 1999; Baraud, 2002;
Birg and Massart, 2001; Sauv, 2009). In the same papers, these pro
edures
are proved to satisfy ora
le inequalities with Cn as small as possible, typi
ally
n
of order ln(n) when Card(Mn ) = e .
11
and
an be di
ult to
ompute in pra
ti
e be
ause of several unknown
on-
stants.
A non-asymptoti
analysis of several global and lo
al penalties
an be found
in the book by Massart (2007) for instan
e; see also Kolt
hinskii (2006) for
re
ent results on lo
al penalties.
3.5 Ad ho
penalization
Let us nally mention that penalties
an also be built a
ording to parti
ular
p
features of the problem. For instan
e, penalties
an be proportional to the
p
norm of sbm (similarly to -regularized learning algorithms) when having an
p
estimator with a
ontrolled norm seems better. The penalty
an also be
proportional to the squared norm of sbm in some reprodu
ing kernel Hilbert
spa
e (similarly to kernel ridge regression or spline smoothing), with a kernel
adapted to the spe
i
framework. More generally, any penalty
an be used, as
soon as pen(m) is larger than the estimation error (to avoid overtting) and the
best model for the nal user is not the ora
le m , but more like
arg min { ( s, Sm ) + pen(m) }
mMn
12
4.1 Cross-validation philosophy
As noti
ed in the early 30s by Larson (1931), training an algorithm and evaluat-
ing its statisti
al performan
e on the same data yields an overoptimisti
result.
CV was raised to x this issue (Mosteller and Tukey, 1968; Stone, 1974; Geisser,
1975), starting from the remark that testing the output of the algorithm on new
data would yield a good estimate of its performan
e (Breiman, 1998).
In most real appli
ations, only a limited amount of data is available, whi
h
led to the idea of splitting the data: Part of the data (the training sample) is
used for training the algorithm, and the remaining data (the validation sample)
is used for evaluating its performan
e. The validation sample
an play the role
of new data as soon as data are i.i.d..
Data splitting yields the validation estimate of the risk, and averaging over
several splits yields a
ross-validation estimate of the risk. As will be shown in
Se
tions 4.2 and 4.3, various splitting strategies lead to various CV estimates of
the risk.
The major interest of CV lies in the universality of the data splitting heuris-
ti
s, whi
h only assumes that data are identi
ally distributed and the train-
ing and validation samples are independent, two assumptions whi
h
an even
be relaxed (see Se
tion 8.3). Therefore, CV
an be applied to (almost) any
algorithm in (almost) any framework, for instan
e regression (Stone, 1974;
Geisser, 1975), density estimation (Rudemo, 1982; Stone, 1984) and
lassi-
ation (Devroye and Wagner, 1979; Bartlett et al., 2002), among many others.
On the
ontrary, most other model sele
tion pro
edures (see Se
tion 3) are spe-
i
to a framework: For instan
e, Cp (Mallows, 1973) is spe
i
to least-squares
regression.
4.2.1 Hold-out
The hold-out (Devroye and Wagner, 1979) or (simple) validation relies on a sin-
(t)
gle split of data. Formally, let I be a non-empty proper subset of { 1, . . . , n },
(t) (v)
c
that is, su
h that both I and its
omplement I = I (t) = { 1, . . . , n } \I (t)
are non-empty. The hold-out estimator of the risk of A(Dn ) with training set
I (t) is dened by
1 X
LbHO A; Dn ; I (t) := A(Dn(t) ); (Xi , Yi ) , (8)
nv (v)
iDn
(t)
where Dn := (i )iI (t) is the training sample, of size nt = Card(I
(t)
), and
(v)
Dn := (i )iI (v) is the validation sample, of size nv = n nt ; I (v) is
alled the
(t)
validation set. The question of
hoosing nt , and I given its
ardinality nt , is
dis
ussed in the rest of this survey.
13
4.2.2 General denition of
ross-validation
A general des
ription of the CV strategy has been given by Geisser (1975): In
brief, CV
onsists in averaging several hold-out estimators of the risk
orre-
sponding to dierent splits of the data. Formally, let B 1 be an integer and
(t) (t)
I1 , . . . , IB be a sequen
e of non-empty proper subsets
of { 1, . . . , n }. The CV
(t)
estimator of the risk of A(Dn ) with training sets Ij is dened by
1jB
1 X bHO
B
(t) (t)
LbCV A; Dn ; Ij := L A; Dn ; Ij . (9)
1jB B j=1
All existing CV estimators of the risk are of the form (9), ea
h one being uniquely
(t)
determined by the way the sequen
e Ij is
hosen, that is, the
hoi
e
1jB
of the splitting s
heme.
Note that when CV is used in model sele
tion for identi
ation, an alterna-
tive denition of CV was proposed by Yang (2006, 2007) and
alled CV with
voting (CV-v). When two algorithms A1 and A2 are
ompared, A1 is sele
ted
by CV-v if and only if L bHO (A1 ; Dn ; I (t) ) < LbHO (A2 ; Dn ; I (t) ) for a majority
j j
of the splits j = 1, . . . , B . By
ontrast, CV pro
edures of the form (9)
an
be
alled CV with averaging (CV-a), sin
e the estimates of the risk of the
algorithms are averaged before their
omparison.
1 X (j)
n
LbLOO ( A; Dn ) = A Dn ; j (10)
n j=1
(j)
where Dn= ( i )i6=j . The name LOO
an be tra
ed ba
k to papers by
Pi
ard and Cook (1984) and by Breiman and Spe
tor (1992), but LOO has sev-
eral other names in the literature, su
h as delete-one CV (see Li, 1987), ordinary
CV (Stone, 1974; Burman, 1989), or even only CV (Efron, 1983; Li, 1987).
14
Leave-p-out (LPO, Shao, 1993) with p { 1, . . . , n } is the exhaustive CV
with nt = n p : every possible set of p data points are su
essively left out
from the sample and used for validation. Therefore, LPO is dened by (9) with
(t)
B = np and (Ij )1jB are all the subsets of { 1, . . . , n } of size p. LPO is also
alled delete-p CV or delete-p multifold CV (Zhang, 1993). Note that LPO with
p=1 is LOO.
(Aj )
where Dn = ( i )iAc . By
onstru
tion, the algorithmi
omplexity of
j
VFCV is only V times that of training A with n n/V data points, whi
h
is mu
h less than LOO or LPO if V n. Note that VFCV with V = n is LOO.
15
Monte-Carlo CV (MCCV, Pi
ard and Cook, 1984) is very
lose to RLT: B
independent subsets of { 1, . . . , n } are randomly drawn, with uniform distribu-
tion among subsets of size nt . The only dieren
e with RLT is that MCCV
allows the same split to be
hosen several times.
Bias-
orre
ted versions of VFCV and RLT risk estimators have been pro-
posed by Burman (1989, 1990), and a
losely related penalization pro
edure
alled V -fold penalization has been dened by Arlot (2008
), see Se
tion 5.1.2
for details.
Analyti
Approximation When CV is used for sele
ting among linear mod-
els, Shao (1993) proposed an analyti
approximation to LPO with p n, whi
h
is
alled APCV.
LOO bootstrap and .632 bootstrap The bootstrap is often used for stabi-
lizing an estimator or an algorithm, repla
ing A(Dn ) by the average of A(Dn )
over several bootstrap resamples Dn . This idea was applied by Efron (1983)
to the LOO estimator of the risk, leading to the LOO bootstrap. Noting that
the LOO bootstrap was biased, Efron (1983) gave a heuristi
argument leading
to the .632 bootstrap estimator of the risk, later modied into the .632+ boot-
strap by Efron and Tibshirani (1997). The main drawba
k of these pro
edures
is the weakness of their theoreti
al justi
ations. Only empiri
al studies have
supported the good behaviour of .632+ bootstrap (Efron and Tibshirani, 1997;
Molinaro et al., 2005).
16
ontrollable and un
ontrollable data splits were distinguished; an instan
e
of un
ontrollable division
an be found in the book by Simon (1971).
A primitive LOO pro
edure was used by Hills (1966) and by
La
henbru
h and Mi
key (1968) for evaluating the error rate of a predi
-
tion rule, and a primitive formulation of LOO
an be found in a paper by
Mosteller and Tukey (1968). Nevertheless, LOO was a
tually introdu
ed inde-
pendently by Stone (1974), by Allen (1974) and by Geisser (1975). The rela-
tionship between LOO and the ja
kknife (Quenouille, 1949), whi
h both rely on
the idea of removing one observation from the sample, has been dis
ussed by
Stone (1974) for instan
e.
The hold-out and CV were originally used only for estimating the risk of an
algorithm. The idea of using CV for model sele
tion arose in the dis
ussion of
a paper by Efron and Morris (1973) and in a paper by Geisser (1974). The rst
author to study LOO as a model sele
tion pro
edure was Stone (1974), who
proposed to use LOO again for estimating the risk of the sele
ted model.
5.1 Bias
Dealing with the bias in
urred by CV estimates
an be made by two strategies:
evaluating the amount of bias in order to
hoose the least biased CV pro
edure,
or
orre
ting for this bias.
(t)
Therefore, assuming that Card(Ij ) = nt for j = 1, . . . , B , the expe
tation of
the CV estimator of the risk only depends on nt :
(t)
E LbCV A; Dn ; Ij = E [ LP ( A ( Dnt ) ) ] . (12)
1jB
In parti
ular (12) shows that the bias of the CV estimator of the risk of A is
the dieren
e between the risks of A,
omputed respe
tively with nt and n data
points. Sin
e nt < n, the bias of CV is usually nonnegative, whi
h
an be proved
rigorously when the risk of A is a de
reasing fun
tion of n, that is, when A is a
smart rule; note however that a
lassi
al algorithm su
h as 1-nearest-neighbour
in
lassi
ation is not smart (Devroye et al., 1996, Se
tion 6.8). Similarly, the
bias of CV tends to de
rease with nt , whi
h is rigorously true if A is smart.
17
More pre
isely, (12) has led to several results on the bias of CV, whi
h
an be
split into three main
ategories: asymptoti
results (A is xed and the sample
size n tends to innity), non-asymptoti
results (where A is allowed to make use
of a number of parameters growing with n, say n1/2 , as often in model sele
tion),
and empiri
al results. They are listed below by statisti
al framework.
Classi
ation For the simple problem of dis
riminating between two popula-
tions with shifted distributions, Davison and Hall (1992)
ompared the asymp-
toti
al bias of LOO and bootstrap, showing the superiority of the LOO when
1/2
the shift size is n : As n tends to innity, the bias of LOO stays of or-
1 1/2
der n , whereas that of bootstrap worsens to the order n . On realisti
syntheti
and real biologi
al data, Molinaro et al. (2005)
ompared the bias of
LOO, VFCV and .632+ bootstrap: The bias de
reases with nt , and is generally
minimal for LOO. Nevertheless, the 10-fold CV bias is nearly minimal uniformly
over their experiments. In the same experiments, .632+ bootstrap exhibits the
smallest bias for moderate sample sizes and small signal-to-noise ratios, but a
mu
h larger bias otherwise.
18
A : Dn 7 A(D
b n ) (Dn ), and then to
ompute LbCV ( A ; Dn ). The resulting
1 X
V
LbcorrVF (A; Dn ) = LbVF ( sb; Dn ) + LPn ( A(Dn ) ) LPn A(Dn(Aj ) ) ,
V j=1
and a
orre
ted RLT estimator was dened similarly. Both estimators have
been proved to be asymptoti
ally unbiased for least-squares estimators in linear
regression.
When the Aj s have exa
tly the same size n/V , the
orre
ted VFCV
riterion
is equal to the sum of the empiri
al risk and the V -fold penalty (Arlot, 2008
),
dened by
V 1 Xh i
V
penVF (A; Dn ) = LPn A(Dn(Aj ) ) L (Aj ) A(Dn(Aj ) ) .
V j=1 Pn
The V -fold penalized
riterion was proved to be (almost) unbiased in the non-
asymptoti
framework for regressogram estimators.
5.2 Varian
e
CV estimators of the risk using training sets of the same size nt have
the same bias,
but they still behave quite dierently; their varian
e
bCV (t)
var(L (A; Dn ; (Ij )1jB ))
aptures most of the information to explain these
dieren
es.
Inuen
e of (nt , nv ) Let us
onsider the hold-out estimator of the risk. Fol-
lowing in parti
ular Nadeau and Bengio (2003),
h i
var LbHO A; Dn ; I (t)
h i
= E var LP (v) A(Dn(t) ) Dn(t) + var [ LP ( A(Dnt ) ) ]
n
1 h i
= E var ( sb , ) | sb = A(Dn(t) ) + var [ LP ( A(Dnt ) ) ] . (13)
nv
The rst term, proportional to 1/nv , shows that more data for validation
bHO , be
ause it yields a better estimator of
de
reases the varian
e of L
(t)
LP A(Dn ) . The se
ond term shows that the varian
e of LbHO also depends
(t)
on the distribution of LP A(Dn ) around its expe
tation; in parti
ular, it
19
Stability and varian
e When A is unstable, LbLOO ( A ) has often been
pointed out as a variable estimator (Se
tion 7.10, Hastie et al., 2001; Breiman,
1996). Conversely, this trend disappears when A is stable, as noti
ed by
Molinaro et al. (2005) from a simulation experiment.
The relation between the stability of A and the varian
e of LbCV ( A ) was
pointed out by Devroye and Wagner (1979) in
lassi
ation, through upper
bLOO ( A ). Bousquet and Elisse (2002) extended
bounds on the varian
e of L
these results to the regression setting, and proved upper bounds on the maxi-
mal upward deviation of L bLOO ( A ).
Note nally that several approa
hes based on the bootstrap have been pro-
posed for redu
ing the varian
e of LbLOO ( A ), su
h as LOO bootstrap, .632
bootstrap and .632+ bootstrap (Efron, 1983); see also Se
tion 4.3.3.
2 2
bVF 4 4 4 2 1
var L (A) = + 2 4+ + 2
+ 3
+ o n2 .
n n V 1 (V 1) (V 1)
20
The asymptoti
al varian
e of the VFCV estimator of the risk de
reases with V,
implying that LOO asymptoti
ally has the minimal varian
e.
Non-asymptoti
losed-form formulas of the varian
e of the LPO estimator
of the risk have been proved by Celisse (2008b) in regression, for proje
tion and
kernel estimators for instan
e. On the varian
e of RLT in the regression setting,
see the asymptoti
results of Girard (1998) for Nadaraya-Watson kernel estima-
tors, as well as the non-asymptoti
omputations and simulation experiments
by Nadeau and Bengio (2003) with several learning algorithms.
Classi
ation For the simple problem of dis
riminating between two popu-
lations with shifted distributions, Davison and Hall (1992) showed that the gap
between asymptoti
varian
es of LOO and bootstrap be
omes larger when data
are noisier. Nadeau and Bengio (2003) made non-asymptoti
omputations and
simulation experiments with several learning algorithms. Hastie et al. (2001)
empiri
ally showed that VFCV has a minimal varian
e for some 2 < V < n,
whereas LOO usually has a large varian
e; this fa
t
ertainly depends on the
stability of the algorithm
onsidered, as showed by simulation experiments by
Molinaro et al. (2005).
21
6.1 Relationship between risk estimation and model se-
le
tion
As shown in Se
tion 3.1, minimizing an unbiased estimator of the risk leads to an
e
ient model sele
tion pro
edure. One
ould
on
lude here that the best CV
pro
edure for estimation is the one with the smallest bias and varian
e (at least
asymptoti
ally), for instan
e, LOO in the least-squares regression framework
(Burman, 1989).
Nevertheless, the best CV estimator of the risk is not ne
essarily the best
model sele
tion pro
edure. For instan
e, Breiman and Spe
tor (1992) observed
that uniformly over the models, the best risk estimator is LOO, whereas 10-
fold CV is more a
urate for model sele
tion. Three main reasons for su
h a
dieren
e
an be invoked. First, the asymptoti
framework (A xed, n )
may not apply to models
lose to the ora
le, whi
h typi
ally has a dimension
growing with n when s does not belong to any model. Se
ond, as explained in
Se
tion 3.2, estimating the risk of ea
h model with some bias
an be bene
ial
and
ompensate the ee
t of a large varian
e, in parti
ular when the signal-to-
noise ratio is small. Third, for model sele
tion, what matters is not that every
estimate of the risk has small bias and varian
e, but more that
with the largest probability for models m1 , m2 near the ora
le.
Therefore, spe
i
studies are required to evaluate the performan
es of the
various CV pro
edures in terms of model sele
tion e
ien
y. In most frame-
works, the model sele
tion performan
e dire
tly follows from the properties of
CV as an estimator of the risk, but not always.
The above results have been proved by Shao (1997) for LPO (see also Li, 1987,
for the LOO); they also hold for RLT whenB n2 sin
e RLT is then equivalent
to LPO (Zhang, 1993).
In a general statisti
al framework, the model sele
tion performan
e
of MCCV, VFCV, LOO, LOO Bootstrap, and .632 bootstrap for se-
le
tion among minimum
ontrast estimators was studied in a series
of papers (van der Laan and Dudoit, 2003; van der Laan et al., 2004, 2006;
van der Vaart et al., 2006); these results apply in parti
ular to least-squares
regression and density estimation. It turns out that under mild
onditions, an
ora
le-type inequality is proved, showing that up to a multiplying fa
tor Cn 1,
22
the risk of CV is smaller than the minimum of the risks of the models with a
sample size nt . In parti
ular, in most frameworks, this implies the asymptoti
optimality of CV as soon as nt n. When nt n with (0, 1), this
naturally generalizes Shao's results.
23
non-asymptoti
ora
le inequalities with
onstant C > 1 have been proved by
Celisse (2008b) for the LPO when p/n [a, b], 0 < a < b < 1.
for some
The performan
e of CV for sele
ting the bandwidth of kernel density esti-
mators has been studied in several papers. With the least-squares
ontrast, the
e
ien
y of LOO was proved by Hall (1983) and generalized to the multivari-
ate framework by Stone (1984); an ora
le inequality asymptoti
ally leading to
e
ien
y was re
ently proved by Dalelane (2005). With the Kullba
k-Leibler
divergen
e, CV
an suer from troubles in performing model sele
tion (see also
S
huster and Gregory, 1981; Chow et al., 1987). The inuen
e of the tails of
the target s was studied by Hall (1987), who gave
onditions under whi
h CV
is e
ient and the
hosen bandwidth is optimal at rst-order.
Classi
ation In the framework of binary
lassi
ation by intervals (that is,
with X = [0, 1] and pie
ewise
onstant
lassiers), Kearns et al. (1997) proved
an ora
le inequality for the hold-out. Furthermore, empiri
al experiments show
that CV yields (almost) always the best performan
e,
ompared to deterministi
penalties (Kearns et al., 1997). On the
ontrary, simulation experiments by
Bartlett et al. (2002) in the same setting showed that random penalties su
h as
Radema
her
omplexity and maximal dis
repan
y usually perform mu
h better
than hold-out, whi
h is shown to be more variable.
Nevertheless, the hold-out still enjoys quite good theoreti
al properties: It
was proved to adapt to the margin
ondition by Blan
hard and Massart (2006),
a property nearly una
hievable with usual model sele
tion pro
edures (see also
Massart, 2007, Se
tion 8.5). This suggests that CV pro
edures are naturally
adaptive to several unknown properties of data in the statisti
al learning frame-
work.
The performan
e of the LOO in binary
lassi
ation was related to the
stability of the
andidate algorithms by Kearns and Ron (1999); they proved
ora
le-type inequalities
alled sanity-
he
k bounds, des
ribing the worst-
ase
performan
e of LOO (see also Bousquet and Elisse, 2002).
An experimental
omparison of several CV methods and bootstrap-based
CV (in parti
ular .632+ bootstrap) in
lassi
ation
an also be found in papers
by Efron (1986) and Efron and Tibshirani (1997).
P ( m(D
b n ) = m0 ) 1 .
n
Classi
al model sele
tion pro
edures built for identi
ation, su
h as BIC, are
des
ribed in Se
tion 3.3.
24
estimation prin
iple, whi
h is only e
ient when the goal is estimation. Fur-
thermore, estimation and identi
ation are somehow
ontradi
tory goals, as
explained in Se
tion 2.4.
This intuition about in
onsisten
y of some CV pro
edures is
onrmed by
several theoreti
al results. Shao (1993) proved that several CV methods are
in
onsistent for variable sele
tion in linear regression: LOO, LPO, and BICV
when lim inf n (nt /n) > 0. Even if these CV methods asymptoti
ally sele
t
all the true variables with probability 1, the probability that they sele
t too
mu
h variables does not tend to zero. More generally, Shao (1997) proved that
CV pro
edures behave asymptoti
ally like GICn with n = 1 + n/nt , whi
h
leads to in
onsisten
y as soon as n/nt = O(1).
In the
ontext of ordered variable sele
tion in linear regression, Zhang (1993)
omputed the asymptoti
value of the probability of sele
ting the true model
for several CV pro
edures. He also numeri
ally
ompared the values of this
probability for the same CV pro
edures in a spe
i
example. For LPO with
p/n (0, 1) as n tends to +, P ( m
b = m0 ) in
reases with . The result is
slightly dierent for VFCV: P (m
b = m0 ) in
reases with V (hen
e, it is maximal
for the LOO, whi
h is the worst
ase of LPO). The variability indu
ed by the
number V of splits seems to be more important here than the bias of VFCV.
Nevertheless, P (m
b = m0 ) is almost
onstant between V = 10 and V = n, so
that taking V > 10 is not advised for
omputational reasons.
These results suggest that if the training sample size nt is negligible in front
of n, then model
onsisten
y
ould be obtained. This has been
onrmed theo-
reti
ally by Shao (1993, 1997) for the variable sele
tion problem in linear regres-
sion: CV is
onsistent when n nt , in parti
ular RLT, BICV (dened in
Se
tion 4.3.2) and LPO with p = pn n and n pn .
Therefore, when the goal is to identify the true model, a larger proportion of
the data should be put in the validation set in order to improve the performan
e.
This phenomenon is somewhat related to the
ross-validation paradox (Yang,
2006).
25
under some
onditions on kb
si skp for p = 2, 4, . In the
lassi
ation frame-
work, if the risk of sbi is measured by P ( s
bi 6= s ), Yang (2006) proved the same
onsisten
y result for CV-v under the
ondition
nv maxi rn2 t ,i
nv , nt and , (15)
snt
26
They proved that smoothed CV yields an ex
ellent asymptoti
al model sele
tion
performan
e under various smoothness
onditions on the density.
Se
ond, when the goal is to estimate the density at one point (and not
globally), Hall and S
hu
any (1989) proposed a lo
al version of CV and proved
its asymptoti
optimality.
27
the bias-
orre
ted CV by Burman (1989) (see Se
tion 5.1). Simulation experi-
ments in several (short range) dependent frameworks show that this
orre
tive
term matters when h/n is not small, in parti
ular when n is small.
The partitioned CV has been proposed by Chu and Marron (1991) for
bandwidth sele
tion: An integer g > 0 is
hosen, a bandwidth bk is
hosen by
CV based upon the subsample ( k+gj )j0 for ea
h k = 1, . . . , g , and the sele
ted
bandwidth is a
ombination of (bk ).
When a parametri
model is available for the dependen
y stru
ture, Hart
(1994) proposed the time series CV.
28
9 Closed-form formulas and fast
omputation
Resampling strategies, like CV, are known to be time
onsuming. The naive im-
plementation of CV has a
omputational
omplexity of B times the
omplexity
of training ea
h algorithm A, whi
h is usually intra
table for LPO, even with
p = 1. The
omputational
ost of VFCV or RLT
an still be quite
ostly when
B > 10 in many pra
ti
al problems. Nevertheless,
losed-form formulas for
CV estimators of the risk
an be obtained in several frameworks, whi
h greatly
de
reases the
omputational
ost of CV.
29
10 Con
lusion: whi
h
ross-validation method
for whi
h problem?
This
on
lusion
olle
ts a few guidelines aiming at helping CV users, rst in-
terpreting the results of CV, se
ond appropriately using CV in ea
h spe
i
problem.
Bias: CV roughly estimates the risk of a model with a sample size nt < n
(see Se
tion 5.1). Usually, this implies that CV overestimates the varian
e
term
ompared to the bias term in the bias-varian
e de
omposition (2)
with sample size n.
When the goal is estimation and the signal-to-noise ratio (SNR) is large,
the smaller bias usually is the better, whi
h is obtained by taking nt n.
Otherwise, CV
an be asymptoti
ally suboptimal. Nevertheless, when the
goal is estimation and the SNR is small, keeping a small upward bias for
the varian
e term often improves the performan
e, whi
h is obtained by
taking nt n with (0, 1). See Se
tion 6.
When the goal is identi
ation, a large bias is often needed, whi
h is
obtained by taking nt n; depending on the framework, larger values of
nt
an also lead to model
onsisten
y, see Se
tion 7.
30
Nevertheless, in density estimation,
losed-form expressions of the LPO es-
timator have been derived by Celisse and Robin (2008) with histograms and
kernel estimators, and by Celisse (2008a) for proje
tion estimators. These ex-
pressions allow to perform LPO without additional
omputational
ost, whi
h
redu
es the aforementioned trade-o to the easier bias-variability trade-o. In
parti
ular, Celisse and Robin (2008) proposed to
hoose p for LPO by minimiz-
ing a
riterion dened as the sum of a squared bias and a varian
e terms (see
also Politis et al., 1999, Chapter 9).
31
On the one hand, the bias of VFCV de
reases with V sin
e nt = n(1 1/V )
observations are used in the training set. On the other hand, the varian
e of
VFCV de
reases with V for small values of V, whereas the LOO (V = n) is
known to suer from a high varian
e in several frameworks su
h as
lassi
ation
or density estimation. Note however that the varian
e of VFCV is minimal for
V = n in some frameworks like linear regression (see Se
tion 5.2). Furthermore,
estimating the varian
e of VFCV from data is a di
ult problem in general, see
Se
tion 5.2.3.
When the goal of model sele
tion is estimation, it is often reported in the
literature that the optimal V is between 5 and 10, be
ause the statisti
al perfor-
man
e does not in
rease mu
h for larger values of V , and averaging over 5 or 10
splits remains
omputationally feasible (Hastie et al., 2001, Se
tion 7.10). Even
if this
laim is
learly true for many problems, the
on
lusion of this survey is
that better statisti
al performan
e
an sometimes be obtained with other values
of V, for instan
e depending on the SNR value.
When the SNR is large, the asymptoti
omparison of CV pro
edures re-
alled in Se
tion 6.2
an be trusted: LOO performs (nearly) unbiased risk es-
timation hen
e is asymptoti
ally optimal, whereas VFCV with V = O(1) is
suboptimal. On the
ontrary, when the SNR is small, overpenalization
an
improve the performan
e. Therefore, VFCV with V < n
an yield a smaller
risk than LOO thanks to its bias and despite its varian
e when V is small (see
simulation experiments by Arlot, 2008
). Furthermore, other CV pro
edures
like RLT
an be interesting alternatives to VFCV, sin
e they allow to
hoose
the bias (through nt ) independently from B , whi
h mainly governs the varian
e.
Another possible alternative is V -fold penalization, whi
h is related to
orre
ted
VFCV (see Se
tion 4.3.3).
When the goal of model sele
tion is identi
ation, the main drawba
k of
VFCV is that nt n is often required for
hoosing
onsistently the true model
(see Se
tion 7), whereas VFCV does not allow nt < n/2. Depending on the
frameworks, dierent (empiri
al) re
ommandations for
hoosing V
an be found
in the literature. In ordered variable sele
tion, the largest V seems to be the
better, V = 10 providing results
lose to the optimal ones (Zhang, 1993). On the
ontrary, Dietteri
h (1998) and Alpaydin (1999) re
ommend V = 2 for
hoosing
the best learning pro
edures among two
andidates.
32
that takes into a
ount the varian
e of CV when
omparing CV methods su
h
as VFCV and RLT with nt = n(V 1)/V but B 6= V .
Referen
es
Akaike, H. (1970). Statisti
al predi
tor identi
ation. Ann. Inst. Statist. Math.,
22:203217.
Allen, D. M. (1974). The relationship between variable sele
tion and data aug-
mentation and a method for predi
tion. Te
hnometri
s, 16:125127.
Anderson, R. L., Allen, D. M., and B., C. F. (1972). Sele
tioon of predi
tor
variables in linear multiple regression. In ban
roft, T. A., editor, In Statisti
al
papers in Honor of George W. Snede
or. Iowa: iowa State University Press.
Arlot, S. (2007). Resampling and Model Sele
tion. PhD thesis, University Paris-
Sud 11. oai:tel.ar
hives-ouvertes.fr:tel-00198803_v1.
Arlot, S. (2008a). Model sele
tion by resampling penalization. Ele
troni
Jour-
nal of Statisti
s. To appear. oai:hal.ar
hives-ouvertes.fr:hal-00262478_v1.
Baraud, Y. (2002). Model sele
tion for regression on a random design. ESAIM
Probab. Statist., 6:127146 (ele
troni
).
Barron, A., Birg, L., and Massart, P. (1999). Risk bounds for model sele
tion
via penalization. Probab. Theory Related Fields, 113(3):301413.
Bartlett, P. L., Bou
heron, S., and Lugosi, G. (2002). Model sele
tion and error
estimation. Ma
hine Learning, 48:85113.
33
Bellman, R. E. and Dreyfus, S. E. (1962). Applied Dynami
Programming.
Prin
eton.
Birg, L. and Massart, P. (2001). Gaussian model sele
tion. J. Eur. Math. So
.
(JEMS), 3(3):203268.
Birg, L. and Massart, P. (2007). Minimal penalties for Gaussian model sele
-
tion. Probab. Theory Related Fields, 138(1-2):3373.
Blan
hard, G. and Massart, P. (2006). Dis
ussion: Lo
al Radema
her
omplex-
ities and ora
le inequalities in risk minimization [Ann. Statist. 34 (2006), no.
6, 25932656 by V. Kolt
hinskii. Ann. Statist., 34(6):26642671.
Bou
heron, S., Bousquet, O., and Lugosi, G. (2005). Theory of
lassi
ation: a
survey of some re
ent advan
es. ESAIM Probab. Stat., 9:323375 (ele
troni
).
Bousquet, O. and Elisse, A. (2002). Stability and Generalization. J. Ma
hine
Learning Resear
h, 2:499526.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Clas-
si
ation and regression trees. Wadsworth Statisti
s/Probability Series.
Wadsworth Advan
ed Books and Software, Belmont, CA.
Breiman, L. and Spe
tor, P. (1992). Submodel sele
tion and evaluation in re-
gression. the x-random
ase. International Statisti
al Review, 60(3):291319.
Burman, P., Chow, E., and Nolan, D. (1994). A
ross-validatory method for
dependent data. Biometrika, 81(2):351358.
34
Burnham, K. P. and Anderson, D. R. (2002). Model sele
tion and multimodel in-
feren
e. Springer-Verlag, New York, se
ond edition. A pra
ti
al information-
theoreti
approa
h.
Chu, C.-K. and Marron, J. S. (1991). Comparison of two bandwidth sele
tors
with dependent errors. Ann. Statist., 19(4):19061918.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline fun
tions.
Estimating the
orre
t degree of smoothing by the method of generalized
ross-validation. Numer. Math., 31(4):377403.
Dalelane, C. (2005). Exa
t ora
le inequality for sharp adaptive kernel density
estimator. Te
hni
al report, arXiv.
Devroye, L., Gyr, L., and Lugosi, G. (1996). A probabilisti
theory of pattern
re
ognition, volume 31 of Appli
ations of Mathemati
s (New York). Springer-
Verlag, New York.
Efron, B. (1983). Estimating the error rate of a predi
tion rule: improvement
on
ross-validation. J. Amer. Statist. Asso
., 78(382):316331.
Efron, B. (1986). How biased is the apparent error rate of a predi
tion rule? J.
Amer. Statist. Asso
., 81(394):461470.
35
Efron, B. (2004). The estimation of predi
tion error:
ovarian
e penalties and
ross-validation. J. Amer. Statist. Asso
., 99(467):619642. With
omments
and a rejoinder by the author.
Fromont, M. (2007). Model sele
tion by bootstrap penalization for
lassi
ation.
Ma
h. Learn., 66(23):165207.
Geisser, S. (1975). The predi
tive sample reuse method with appli
ations. J.
Amer. Statist. Asso
., 70:320328.
Grnwald, P. D. (2007). The Minimum Des
ription Length Prin
iple. MIT
Press, Cambridge, MA, USA.
Gyr, L., Kohler, M., Krzyak, A., and Walk, H. (2002). A distribution-free
theory of nonparametri
regression. Springer Series in Statisti
s. Springer-
Verlag, New York.
Hall, P. (1987). On Kullba
k-Leibler loss and density estimation. The Annals
of Statisti
s, 15(4):14911519.
Hall, P., Lahiri, S. N., and Polzehl, J. (1995). On bandwidth
hoi
e in nonpara-
metri
regression with both short- and long-range dependent errors. Ann.
Statist., 23(6):19211936.
Hrdle, W., Hall, P., and Marron, J. S. (1988). How far are automati
ally
hosen
regression smoothing parameters from their optimum? J. Amer. Statist. As-
so
., 83(401):86101. With
omments by David W. S
ott and Iain Johnstone
and a reply by the authors.
36
Hart, J. D. (1994). Automated kernel smoothing of dependent data by using
time series
ross-validation. J. Roy. Statist. So
. Ser. B, 56(3):529542.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statisti-
al learning. Springer Series in Statisti
s. Springer-Verlag, New York. Data
mining, inferen
e, and predi
tion.
Hesterberg, T. C., Choi, N. H., Meier, L., and Fraley, C. (2008). Least angle and
l1 penalized regression: A review. Statisti
s Surveys, 2:6193 (ele
troni
).
Hills, M. (1966). Allo
ation Rules and their Error Rates. J. Royal Statist. So
.
Series B, 28(1):131.
Hurvi
h, C. M. and Tsai, C.-L. (1989). Regression and time series model sele
-
tion in small samples. Biometrika, 76(2):297307.
Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Experimental
and Theoreti
al Comparison of Model Sele
tion Methods. Ma
hine Learning,
27:750.
Kolt
hinskii, V. (2001). Radema
her penalties and stru
tural risk minimization.
IEEE Trans. Inform. Theory, 47(5):19021914.
37
Leung, D., Marriott, F., and Wu, E. (1993). Bandwidth sele
tion in robust
smoothing. J. Nonparametr. Statist., 2:333339.
Li, K.-C. (1985). From stein's unbiased risk estimates to the method of gener-
alized
ross validation. Ann. Statist., 13(4):13521377.
Markatou, M., Tian, H., Biswas, S., and Hrip
sak, G. (2005). Analysis of
varian
e of
ross-validation estimators of the generalization error. J. Ma
h.
Learn. Res., 6:11271168 (ele
troni
).
Massart, P. (2007). Con
entration inequalities and model sele
tion, volume 1896
of Le
ture Notes in Mathemati
s. Springer, Berlin. Le
tures from the 33rd
Summer S
hool on Probability Theory held in Saint-Flour, July 623, 2003,
With a foreword by Jean Pi
ard.
Molinaro, A. M., Simon, R., and Pfeier, R. M. (2005). Predi
tion error estima-
tion: a
omparison of resampling methods. Bioinformati
s, 21(15):33013307.
Mosteller, F. and Tukey, J. W. (1968). Data analysis, in
luding statisti
s. In
Lindzey, G. and Aronson, E., editors, Handbook of So
ial Psy
hology, Vol. 2.
Addison-Wesley.
Nadeau, C. and Bengio, Y. (2003). Inferen
e for the generalization error. Ma-
hine Learning, 52:239281.
Opsomer, J., Wang, Y., and Yang, Y. (2001). Nonparametri
regression with
orrelated errors. Statist. S
i., 16(2):134153.
Ron
hetti, E., Field, C., and Blan
hard, W. (1997). Robust linear model sele
-
tion by
ross-validation. J. Amer. Statist. Asso
., 92:10171023.
38
Rudemo, M. (1982). Empiri
al Choi
e of Histograms and Kernel Density Esti-
mators. S
andinavian Journal of Statisti
s, 9:6578.
Shao, J. (1997). An asymptoti
theory for linear model sele
tion. Statist. Sini
a,
7(2):221264. With
omments and a rejoinder by the author.
Shibata, R. (1984). Approximate e
ien
y of a sele
tion pro
edure for the
number of regression variables. Biometrika, 71(1):4349.
Stone, C. (1984). An asymptoti
ally optimal window sele
tion rule for kernel
density estimates. The Annals of Statisti
s, 12(4):12851297.
Tibshirani, R. (1996). Regression Shrinkage and Sele
tion via the Lasso. J.
Royal Statist. So
. Series B, 58(1):267288.
van der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptoti
optimality
of likelihood-based
ross-validation. Stat. Appl. Genet. Mol. Biol., 3:Art. 4,
27 pp. (ele
troni
).
39
van der Laan, M. J., Dudoit, S., and van der Vaart, A. W. (2006). The
ross-
validated adaptive epsilon-net estimator. Statist. De
isions, 24(3):373395.
van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Ora
le
inequalities for multi-fold
ross validation. Statist. De
isions, 24(3):351371.
van Erven, T., Grnwald, P. D., and de Rooij, S. (2008). Cat
hing up
faster by swit
hing sooner: A prequential solution to the ai
-bi
dilemma.
arXiv:0807.1005.
Wahba, G. (1975). Periodi
splines for spe
tral density estimation: The use of
ross validation for determining the degree of smoothing. Communi
ations in
Statisti
s, 4:125142.
Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A
oni
t be-
tween model indenti
ation and regression estimation. Biometrika, 92(4):937
950.
Yang, Y. (2006). Comparing learning methods for
lassi
ation. Statist. Sini
a,
16(2):635657.
Zhang, P. (1993). Model sele
tion via multifold
ross validation. Ann. Statist.,
21(1):299313.
40