You are on page 1of 40

A survey of

ross-validation pro edures for

model sele tion


arXiv:0907.4728v1 [math.ST] 27 Jul 2009

July 27, 2009

Sylvain Arlot,
CNRS ; Willow Proje t-Team,
Laboratoire d'Informatique de l'E ole Normale Superieure
(CNRS/ENS/INRIA UMR 8548)
45, rue d'Ulm, 75 230 Paris, Fran e
Sylvain.Arlotens.fr

Alain Celisse,
Laboratoire Paul Painlev, UMR CNRS 8524,
Universit des S ien es et Te hnologies de Lille 1
F-59 655 Villeneuve dAs q Cedex, Fran e
Alain.Celissemath.univ-lille1.fr
Abstra t
Used to estimate the risk of an estimator or to perform model sele -

tion, ross-validation is a widespread strategy be ause of its simpli ity

and its apparent universality. Many results exist on the model sele tion

performan es of ross-validation pro edures. This survey intends to relate

these results to the most re ent advan es of model sele tion theory, with a

parti ular emphasis on distinguishing empiri al statements from rigorous

theoreti al results. As a on lusion, guidelines are provided for hoosing

the best ross-validation pro edure a ording to the parti ular features of

the problem in hand.

Contents
1 Introdu tion 3
1.1 Statisti al framework . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Statisti al algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Model sele tion 6


2.1 The model sele tion paradigm . . . . . . . . . . . . . . . . . . . . 6
2.2 Model sele tion for estimation . . . . . . . . . . . . . . . . . . . . 7
2.3 Model sele tion for identi ation . . . . . . . . . . . . . . . . . . 8
2.4 Estimation vs. identi ation . . . . . . . . . . . . . . . . . . . . . 8

1
3 Overview of some model sele tion pro edures 8
3.1 The unbiased risk estimation prin iple . . . . . . . . . . . . . . . 9
3.2 Biased estimation of the risk . . . . . . . . . . . . . . . . . . . . 10
3.3 Pro edures built for identi ation . . . . . . . . . . . . . . . . . . 11
3.4 Stru tural risk minimization . . . . . . . . . . . . . . . . . . . . . 11
3.5 Ad ho penalization . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Where are ross-validation pro edures in this pi ture? . . . . . . 12

4 Cross-validation pro edures 12


4.1 Cross-validation philosophy . . . . . . . . . . . . . . . . . . . . . 13
4.2 From validation to ross-validation . . . . . . . . . . . . . . . . . 13
4.2.1 Hold-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 General denition of ross-validation . . . . . . . . . . . . 14
4.3 Classi al examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.1 Exhaustive data splitting . . . . . . . . . . . . . . . . . . 14
4.3.2 Partial data splitting . . . . . . . . . . . . . . . . . . . . . 15
4.3.3 Other ross-validation-like risk estimators . . . . . . . . . 16
4.4 Histori al remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Statisti al properties of ross-validation estimators of the risk 17


5.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.1 Theoreti al assessment of the bias . . . . . . . . . . . . . 17
5.1.2 Corre tion of the bias . . . . . . . . . . . . . . . . . . . . 19
5.2 Varian e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Variability fa tors . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Theoreti al assessment of the varian e . . . . . . . . . . . 20
5.2.3 Estimation of the varian e . . . . . . . . . . . . . . . . . . 21

6 Cross-validation for e ient model sele tion 21


6.1 Relationship between risk estimation and model sele tion . . . . 22
6.2 The global pi ture . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Results in various frameworks . . . . . . . . . . . . . . . . . . . . 23

7 Cross-validation for identi ation 24


7.1 General onditions towards model onsisten y . . . . . . . . . . . 24
7.2 Rened analysis for the algorithm sele tion problem . . . . . . . 25

8 Spe i ities of some frameworks 26


8.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Robustness to outliers . . . . . . . . . . . . . . . . . . . . . . . . 27
8.3 Time series and dependent observations . . . . . . . . . . . . . . 27
8.4 Large number of models . . . . . . . . . . . . . . . . . . . . . . . 28

9 Closed-form formulas and fast omputation 29


10 Con lusion: whi h ross-validation method for whi h problem? 30
10.1 The general pi ture . . . . . . . . . . . . . . . . . . . . . . . . . . 30
10.2 How the splits should be hosen? . . . . . . . . . . . . . . . . . . 31
10.3 V-fold ross-validation . . . . . . . . . . . . . . . . . . . . . . . . 31
10.4 Future resear h . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2
1 Introdu tion
Many statisti al algorithms, su h as likelihood maximization, least squares and
empiri al ontrast minimization, rely on the preliminary hoi e of a model, that
is of a set of parameters from whi h an estimate will be returned. When several
andidate models (thus algorithms) are available, hoosing one of them is alled
the model sele tion problem.
Cross-validation (CV) is a popular strategy for model sele tion, and more
generally algorithm sele tion. The main idea behind CV is to split the data (on e
or several times) for estimating the risk of ea h algorithm: Part of the data (the
training sample) is used for training ea h algorithm, and the remaining part
(the validation sample) is used for estimating the risk of the algorithm. Then,
CV sele ts the algorithm with the smallest estimated risk.
Compared to the resubstitution error, CV avoids overtting be ause the
training sample is independent from the validation sample (at least when data
are i.i.d.). The popularity of CV mostly omes from the generality of the data
splitting heuristi s, whi h only assumes that data are i.i.d.. Nevertheless, the-
oreti al and empiri al studies of CV pro edures do not entirely onrm this
universality. Some CV pro edures have been proved to fail for some model
sele tion problems, depending on the goal of model sele tion: estimation or
identi ation (see Se tion 2). Furthermore, many theoreti al questions about
CV remain widely open.
The aim of the present survey is to provide a lear pi ture of what is known
about CV, from both theoreti al and empiri al points of view. More pre isely,
the aim is to answer the following questions: What is CV doing? When does
CV work for model sele tion, keeping in mind that model sele tion an target
dierent goals? Whi h CV pro edure should be used for ea h model sele tion
problem?
The paper is organized as follows. First, the rest of Se tion 1 presents the
statisti al framework. Although non exhaustive, the present setting has been
hosen general enough for sket hing the omplexity of CV for model sele tion.
The model sele tion problem is introdu ed in Se tion 2. A brief overview of
some model sele tion pro edures that are important to keep in mind for un-
derstanding CV is given in Se tion 3. The most lassi al CV pro edures are
dened in Se tion 4. Sin e they are the keystone of the behaviour of CV for
model sele tion, the main properties of CV estimators of the risk for a xed
model are detailed in Se tion 5. Then, the general performan es of CV for
model sele tion are des ribed, when the goal is either estimation (Se tion 6) or
identi ation (Se tion 7). Spe i properties of CV in some parti ular frame-
works are dis ussed in Se tion 8. Finally, Se tion 9 fo uses on the algorithmi
omplexity of CV pro edures, and Se tion 10 on ludes the survey by ta kling
several pra ti al questions about CV.

1.1 Statisti al framework


Assume that some data 1 , . . . , n with ommon distribution P are ob-
served. Throughout the paperex ept in Se tion 8.3the i are assumed to
be independent. The purpose of statisti al inferen e is to estimate from the
data ( i )1in some target feature s of the unknown distribution P , su h as
the mean or the varian e of P. Let S denote the set of possible values for s.

3
The quality of t S, as an approximation of s, is measured by its loss L ( t ),
where L : S 7 R is alled the loss fun tion, and is assumed to be minimal for
t = s. Many loss fun tions an be hosen for a given statisti al problem.
Several lassi al loss fun tions are dened by

L ( t ) = LP ( t ) := EP [ ( t; ) ] , (1)

where : S 7 [0, ) is alled a ontrast fun tion. Basi ally, for t S


and , (t; ) measures how well t is in a ordan e with observation of ,
so that the loss of t, dened by (1), measures the average a ordan e between
t and new observations with distribution P . Therefore, several frameworks
su h as transdu tive learning do not t denition (1). Nevertheless, as detailed
in Se tion 1.2, denition (1) in ludes most lassi al statisti al frameworks.
Another useful quantity is the ex ess loss
( s, t ) := LP ( t ) LP ( s ) 0 ,
whi h is related to the risk of an estimator sb of the target s by

R( sb ) = E1 ,...,n P [ ( s, sb ) ] .

1.2 Examples
The purpose of this subse tion is to show that the framework of Se tion 1.1
in ludes several important statisti al frameworks. This list of examples does
not pretend to be exhaustive.

Density estimation aims at estimating the density s of P with respe t to


some given measure . Then, S is the set of densities on with respe t
on
to . For instan e, taking (t; x) = ln(t(x)) in (1), the loss is minimal when
t=s and the ex ess loss
   Z s
s()
( s, t ) = LP ( t ) LP ( s ) = EP ln = s ln d
t() t
is the Kullba k-Leibler divergen e between distributions t and s.

Predi tion aims at predi ting a quantity of interest Y Y given an explana-


tory variable X X and a sample of observations (X1 , Y1 ), . . . , (Xn , Yn ). In
other words, = X Y , S is the set of measurable mappings X 7 Y and
the ontrast (t; (x, y)) measures the dis repan y between the observed y and
its predi ted value t(x). Two lassi al predi tion frameworks are regression and
lassi ation, whi h are detailed below.

Regression orresponds to ontinuous Y, Rk for multivari-


that is YR (or

ate regression), the feature spa e X being typi ally a subset of R . Let s denote
the regression fun tion, that is s(x) = E(X,Y )P [ Y | X = x ], so that

i, Yi = s(Xi ) + i with E [ i | Xi ] = 0 .
A popular ontrast in regression is the least-squares ontrast ( t; (x, y) ) =
(t(x) y)2 , whi h is minimal over S for t = s, and the ex ess loss is
h i
2
( s, t ) = E(X,Y )P ( s(X) t(X) ) .

4
Note that the ex ess loss of t is the square of the L2 distan e between t and s,
so that predi tion and estimation are equivalent goals.

Classi ation orresponds to nite Y (at least dis rete). In parti ular, when
Y = { 0, 1 }, the predi tion problem is alled binary (supervised) lassi ation.
With the 0-1 ontrast fun tion (t; (x, y)) = 1lt(x)6=y , the minimizer of the loss
is the so- alled Bayes lassier s dened by

s(x) = 1l(x)1/2 ,

where denotes the regression fun tion (x) = P(X,Y )P ( Y = 1 | X = x ).


Remark that a slightly dierent framework is often onsidered in binary las-
si ation. Instead of looking only for a lassier, the goal is to estimate also the
onden e in the lassi ation made at ea h point: S is the set of measurable
mappings X 7 R, the lassier x 7 1lt(x)0 being asso iated to any t S.
Basi ally, the larger |t(x)|, the more ondent we are in the lassi ation made
from t(x). A lassi al family of losses asso iated with this problem is dened by
(1) with the ontrast ( t; (x, y) ) = ( (2y 1)t(x) ) where : R 7 [0, )
is some fun tion. The 0-1 ontrast orresponds to (u) = 1lu0 . The onvex
loss fun tions orrespond to the ase where is onvex, nonde reasing with
lim = 0 and (0) = 1. Classi al examples are (u) = max { 1 + u, 0 }
(hinge), (u) = exp(u), and (u) = log2 ( 1 + exp(u) ) (logit). The orrespond-
ing losses are used as obje tive fun tions by several lassi al learning algorithms
su h as support ve tor ma hines (hinge) and boosting (exponential and logit).
Many referen es on lassi ation theory, in luding model sele tion, an be
found in the survey by Bou heron et al. (2005).

1.3 Statisti al algorithms


In this survey, a statisti al algorithm A is any (measurable) mapping A :
S n
nN 7 S . The idea is that data Dn = ( i )1in n will be used as
an input of A, and that the output of A, A(Dn ) = s bA (Dn) S, is an estimator
A
of s. The quality of A is then measured by LP s b (Dn ) , whi h should be as
small as possible. In the sequel, the algorithm A and the estimator s bA (Dn ) are
often identied when no onfusion is possible.

Minimum ontrast estimators form a lassi al family of statisti al algorithms,


dened as follows. Given some subset S of S that we all a model, a minimum
ontrast estimator of s is any minimizer of the empiri al ontrast

n n
1X 1X
t 7 LPn ( t ) = ( t; i ) , where Pn = ,
n i=1 n i=1 i

over S . The idea is that the empiri al ontrast LPn ( t ) has an expe tation
LP ( t ) whi h is minimal over S at s. Hen e, minimizing LPn ( t ) over a set S of
andidate values for s hopefully leads to a good estimator of s. Let us now give
three popular examples of empiri al ontrast minimizers:

Maximum likelihood estimators: take (t; x) = ln(t(x)) in the density


estimation setting. A lassi al hoi e for S is the set of pie ewise onstant
fun tions on a regular partition of with K pie es.

5
Least-squares estimators: (t; (x, y)) = (t(x) y)2 the least-squares
take
ontrast in the regression setting. For instan e, S an be the set of pie e-
wise onstant fun tions on some xed partition of X (leading to regresso-
grams), or a ve tor spa e spanned by the rst ve tors of wavelets or Fourier
basis, among many others. Note that regularized least-squares algorithms
su h as the Lasso, ridge regression and spline smoothing also are least-
squares estimators, the model S being some ball of a (data-dependent)
1 2
radius for the L (resp. L ) norm in some high-dimensional spa e. Hen e,
tuning the regularization parameter for the LASSO or SVM, for instan e,
amounts to perform model sele tion from a olle tion of models.

Empiri al risk minimizers, following the terminology of Vapnik (1982):


take any ontrast fun tion in the predi tion setting. When is the 0-1
ontrast, popular hoi es for S lead to linear lassiers, partitioning rules,
and neural networks. Boosting and Support Ve tor Ma hines lassiers
also are empiri al ontrast minimizers over some data-dependent model
S, with ontrast = for some onvex fun tions .

Let us nally mention that many other lassi al statisti al algorithms an


be onsidered with CV, for instan e lo al average estimators in the predi tion
framework su h as k -Nearest Neighbours and Nadaraya-Watson kernel estima-
tors. The fo us will be mainly kept on minimum ontrast estimators to keep
the length of the survey reasonable.

2 Model sele tion


Usually, several statisti al algorithms an be used for solving a given statisti al
problem. Let ( sb ) denote su h a family of andidate statisti al algorithms.
The algorithm sele tion problem aims at hoosing from data one of these algo-
b n ) . Then, the nal estimator of s is given
rithms, that is, hoosing some (D
by sb(D
b n ) (Dn ). The main di ulty is that the same data are used for training

the algorithms, that is, for omputing ( sb (Dn ) ) b n) .


, and for hoosing (D

2.1 The model sele tion paradigm


Following Se tion 1.3, let us fo us on the model sele tion problem, where an-
didate algorithms are minimum ontrast estimators and the goal is to hoose a
model S . Let ( Sm )mM be a family of models, that is, Sm S. Let be a
n
xed ontrast fun tion, and for every m Mn , let s
bm be a minimum ontrast
estimator over model Sm with ontrast . The goal is to hoose m(D
b n ) Mn
from data only.

The hoi e of a model Sm has to be done arefully. Indeed, when Sm is a


small model, sbm is a poor statisti al algorithm ex ept when s is very lose to
Sm , sin e
( s, sbm ) inf { ( s, t ) } := ( s, Sm ) .
tSm

The lower bound ( s, Sm ) is alled the bias of model Sm , or approximation


error. The bias is a nonin reasing fun tion of Sm .

6
On the ontrary, when Sm is huge, its bias ( s, Sm ) is small for most
targets s, but sbm learly overts. Think for instan e of Sm as the set of all
ontinuous fun tions on [0, 1] in the regression framework. More generally, if
Sm is a ve tor spa e of dimension Dm , in several lassi al frameworks,

E [ ( s, sbm (Dn ) ) ] ( s, Sm ) + Dm (2)

where > 0 does not depend on m. = 1/(2n) in density


For instan e,
2
estimation using the likelihood ontrast, and = /n in regression using the
least-squares ontrast and assuming var ( Y | X ) = 2 does not depend on X .
The meaning of (2) is that a good model hoi e should balan e the bias term
( s, Sm ) and the varian e term Dm , that is solve the so- alled bias-varian e
trade-o. By extension, the varian e term, also alled estimation error, an be
dened by

E [ ( s, sbm (Dn ) ) ] ( s, Sm ) = E [ LP ( sbm ) ] inf LP ( t ) ,


tSm

even when (2) does not hold.


The interested reader an nd a mu h deeper insight into model sele tion in
the Saint-Flour le ture notes by Massart (2007).

Before giving examples of lassi al model sele tion pro edures, let us mention
the two main dierent goals that model sele tion an target: estimation and
identi ation.

2.2 Model sele tion for estimation


On the one hand, the goal of model sele tion is estimation when sbm(D
b n)
(Dn )
is used as an approximation of the target s, and the goal is to minimize its
loss. For instan e, AIC and Mallows' Cp model sele tion pro edures are built
for estimation (see Se tion 3.1).
The quality of a model sele tion pro edure Dn 7 m(D b n ), designed for esti-
mation, is measured by the ex ess loss of sbm(D
b n)
(D n ). Hen e, the best possible
model hoi e for estimation is the so- alled ora le model Sm , dened by

m = m (Dn ) arg min { ( s, sbm (Dn ) ) } . (3)


mMn

Sin e m (Dn ) depends on the unknown distribution P of data, one annot


expe t to sele t b n ) = m (Dn ) almost surely. Nevertheless, we an hope to
m(D
sele t m(D
b n) su h that sb m(D
b n)
is almost as lose to s
S as sb m (Dn ) . Note that
there is no requirement for s to belong to mMn Sm .
Depending on the framework, the optimality of a model sele tion pro edure
for estimation is assessed in at least two dierent ways.
First, in the asymptoti framework, a model sele tion pro edure m
b is alled
e ient (or asymptoti ally optimal) when it leads to m
b su h that

s, b
sm(D
b n)
(Dn ) a.s.
1 .
inf mMn { ( s, b
sm (Dn ) ) } n
Sometimes, a weaker result is proved, the onvergen e holding only in probabil-
ity.

7
Se ond, in the non-asymptoti framework, a model sele tion pro edure sat-
ises an ora le inequality with onstant Cn 1 and remainder term Rn 0
when

s, sbm(D
b n)
(Dn ) Cn inf { ( s, sbm (Dn ) ) } + Rn (4)
mMn

holds either in expe tation or with large probability (that is, a probability larger
2
than 1 C /n , for some positive onstant C ). Note that if (4) holds on
a large probability event with Cn tending to 1 when n tends to innity and
Rn ( s, sbm (Dn ) ), then the model sele tion pro edure m
b is e ient.

In the estimation setting, model sele tion is often used for building adaptive
estimators, assuming that s belongs to some fun tion spa e T (Barron et al.,
1999).Then, a model sele tion pro edure m
b is optimal when it leads to an estima-
tor sbm(D
b n)
(Dn ) (approximately) minimax with respe t to T without knowing
, provided the family ( Sm )mMn has been well- hosen.

2.3 Model sele tion for identi ation


On the other hand, model sele tion an aim at identifying the true model
S m0 , dened as the smallest model among
S ( Sm )mMn to whi h s belongs.
In parti ular, s mMn Sm is assumed in this setting. A typi al example of
model sele tion pro edure built for identi ation is BIC (see Se tion 3.3).
The quality of a model sele tion pro edure designed for identi ation is
measured by its probability of re overing the true model m0 . Then, a model
sele tion pro edure is alled (model) onsistent when

P ( m(D
b n ) = m0 ) 1 .
n

Note that identi ation an naturally be extended to the general algorithm


sele tion problem, the true model being repla ed by the statisti al algorithm
whose risk onverges at the fastest rate (see for instan e Yang, 2007).

2.4 Estimation vs. identi ation


When a true model exists, model onsisten y is learly a stronger property than
e ien y dened in Se tion 2.2. However, in many frameworks, no true model
does exist so that e ien y is the only well-dened property.
Could a model sele tion pro edure be model onsistent in the former ase
(like BIC) and e ient in the latter ase (like AIC)? The general answer to this
question, often alled the AIC-BIC dilemma, is negative: Yang (2005) proved in
the regression framework that no model sele tion pro edure an be simultane-
ously model onsistent and minimax rate optimal. Nevertheless, the strengths
of AIC and BIC an sometimes be shared; see for instan e the introdu tion of
a paper by Yang (2005) and a re ent paper by van Erven et al. (2008).

3 Overview of some model sele tion pro edures


Several approa hes an be used for model sele tion. Let us briey sket h here
some of them, whi h are parti ularly helpful for understanding how CV works.

8
Like CV, all the pro edures onsidered in this se tion sele t

m(D
b n ) arg min { crit(m; Dn ) } , (5)
mMn

where m Mn , crit(m; Dn ) = crit(m) R is some data-dependent riterion.


penalization, whi h onsists in hoosing the model
A parti ular ase of (5) is
minimizing the sum of empiri al ontrast and some measure of omplexity of
the model ( alled penalty) whi h an depend on the data, that is,

m(D
b n ) arg min { LPn ( sbm ) + pen(m; Dn ) } . (6)
mMn

This se tion does not pretend to be exhaustive. Completely dierent approa hes
exist for model sele tion, su h as the Minimum Des ription Length (MDL)
(Rissanen, 1983), and the Bayesian approa hes. The interested reader will
nd more details and referen es on model sele tion pro edures in the books
by Burnham and Anderson (2002) or Massart (2007) for instan e.
Let us fo us here on ve main ategories of model sele tion pro edures, the
rst three ones oming from a lassi ation made by Shao (1997) in the linear
regression framework.

3.1 The unbiased risk estimation prin iple


When the goal of model sele tion is estimation, many model sele tion pro-
edures are of the form (5) where crit(m; Dn ) unbiasedly estimates (at least,
asymptoti ally) the loss LP ( sbm ). This general idea is often alled unbiased
risk estimation prin iple, or Mallows' or Akaike's heuristi s.
In order to explain why this strategy an perform well, let us write the
starting point of most theoreti al analysis of pro edures dened by (5): By
denition (5), for every m Mn ,

( s, sbm
b ) + crit(m)
b LP ( sbm
b ) ( s, s
bm ) + crit(m) LP ( sbm ) . (7)

If E [ crit(m) LP ( sbm ) ] = 0 for every m Mn , then on entration inequalities


+
are likely to prove that n , n > 0 exist su h that

crit(m) LP ( sbm )
m Mn , +
n
n > 1
( s, sbm )

with high probability, at least when Card(Mn ) Cn for some C, 0. Then,


+
(7) dire tly implies an ora le inequality like (4) with Cn = (1 + n )/(1 n ). If
+
n , n 0 when n , this proves the pro edure dened by (5) is e ient.
Examples of model sele tion pro edures following the unbiased risk estima-
tion prin iple are FPE (Final Predi tion Error, Akaike, 1970), several ross-
validation pro edures in luding the Leave-one-out (see Se tion 4), and GCV
(Generalized Cross-Validation, Craven and Wahba, 1979, see Se tion 4.3.3).
With the penalization approa h (6), the unbiased risk estimation prin iple is
that E [ pen(m) ] should be lose to the ideal penalty

penid (m) := LP ( sbm ) LPn ( sbm ) .

Several lassi al penalization pro edures follow this prin iple, for instan e:

9
With the log-likelihood ontrast, AIC (Akaike Information Criterion,
Akaike, 1973) and its orre ted versions (Sugiura, 1978; Hurvi h and Tsai,
1989).

With the least-squares ontrast, Mallows' Cp (Mallows, 1973) and several


rened versions of Cp (see for instan e Baraud, 2002).

With a general ontrast, ovarian e penalties (Efron, 2004).

AIC, Mallows' Cp and related pro edures have been proved to be optimal
for estimation in several frameworks, provided Card(Mn ) Cn for some
onstants C, 0 (see the paper by Birg and Massart, 2007, and referen es
therein).

The main drawba k of penalties su h as AIC or Mallows' Cp is their depen-


den e on some assumptions on the distribution of data. For instan e, Mallows'
Cp assumes the varian e of Y does not depend on X. Otherwise, it has a
suboptimal performan e (Arlot, 2008b).
Several resampling-based penalties have been proposed to over ome this
problem, at the pri e of a larger omputational omplexity, and possibly slightly
worse performan e in simpler frameworks; see a paper by Efron (1983) for boot-
strap, and a paper by Arlot (2008a) and referen es therein for generalization to
ex hangeable weights.

Finally, note that all these penalties depend on multiplying fa tors


whi h are not always known (for instan e, the noise-level, for Mallows' Cp ).
Birg and Massart (2007) proposed a general data-driven pro edure for estimat-
ing su h multiplying fa tors, whi h satises an ora le inequality with Cn 1
in regression (see also Arlot and Massart, 2009).

3.2 Biased estimation of the risk


Several model sele tion pro edures are of the form (5) where crit(m) does not
unbiasedly estimate the loss LP ( sbm ): The weight of the varian e term om-
pared to the bias in E [ crit(m) ] is slightly larger than in the de omposition (2)
bm ). From the penalization point of view, su h pro edures are overpe-
of LP ( s
nalizing.
Examples of su h pro edures are FPE (Bhansali and Downham, 1977) and
GIC (Generalized Information Criterion, Nishii, 1984; Shao, 1997) with , >
2, whi h are losely related. Some ross-validation pro edures, su h as Leave-
p-out with p/n (0, 1) xed, also belong to this ategory (see Se tion 4.3.1).
Note that FPE with = 2 is FPE, and GIC with = 2 is lose to FPE and
Mallows' Cp .

When the goal is estimation, there are two main reasons for using biased
model sele tion pro edures. First, experimental eviden e show that overpenal-
izing often yields better performan e when the signal-to-noise ratio is small (see
for instan e Arlot, 2007, Chapter 11).
Se ond, when the number of models Card(Mn ) grows faster than any power
of n, as in the omplete variable sele tion problem with n variables, then the
unbiased risk estimation prin iple fails. From the penalization point of view,
Birg and Massart (2007) proved that when Card(Mn ) = en for some > 0,

10
the minimal amount of penalty required so that an ora le inequality holds with
Cn = O(1) is mu h larger than penid (m). In addition to the FPE and GIC
with suitably hosen , , several penalization pro edures have been proposed
for taking into a ount the size of Mn (Barron et al., 1999; Baraud, 2002;
Birg and Massart, 2001; Sauv, 2009). In the same papers, these pro edures
are proved to satisfy ora le inequalities with Cn as small as possible, typi ally
n
of order ln(n) when Card(Mn ) = e .

3.3 Pro edures built for identi ation


Some spe i model sele tion pro edures are used for identi ation. A typi al
example is BIC (Bayesian Information Criterion, S hwarz, 1978).
More generally, Shao (1997) showed that several pro edures identify on-
sistently the orre t model in the linear regression framework as soon as they
overpenalize within a fa tor tending to innity with n, for instan e, GICn with
n +, FPEn with n + (Shibata, 1984), and several CV pro edures
su h as Leave-p-out with p = pn n. BIC is also part of this pi ture, sin e it
oin ides with GICln(n) .
In another paper, Shao (1996) showed that mn -out-of-n bootstrap penaliza-
tion is also model onsistent as soon as mn n. Compared to Efron's bootstrap
penalties, the idea is to estimate penid with the mn -out-of-n bootstrap instead
of the usual bootstrap, whi h results in overpenalization within a fa tor tending
to innity with n (Arlot, 2008a).
Most MDL-based pro edures an also be put into this ategory of model
sele tion pro edures (see Grnwald, 2007). Let us nally mention the Lasso
1
(Tibshirani, 1996) and other penalization pro edures, whi h have re ently
attra ted mu h attention (see for instan e Hesterberg et al., 2008). They are
a omputationally e ient way of identifying the true model in the ontext of
variable sele tion with many variables.

3.4 Stru tural risk minimization


In the ontext of statisti al learning, Vapnik and Chervonenkis (1974) pro-
posed the stru tural risk minimization approa h (see also Vapnik, 1982, 1998).
Roughly, the idea is to penalize the empiri al ontrast with a penalty (over)-
estimating

penid,g (m) := sup { LP ( t ) LPn ( t ) } penid (m) .


tSm

Su h penalties have been built using the Vapnik-Chervonenkis dimension, the


ombinatorial entropy, (global) Radema her omplexities (Kolt hinskii, 2001;
Bartlett et al., 2002), (global) bootstrap penalties (Fromont, 2007), Gaus-
sian omplexities or the maximal dis repan y (Bartlett and Mendelson, 2002).
These penalties are often alled global be ause penid,g (m) is a supremum over
Sm .
The lo alization approa h (see Bou heron et al., 2005) has been introdu ed
in order to obtain penalties loser to penid (su h as lo al Radema her om-
plexities), hen e smaller predi tion errors when possible (Bartlett et al., 2005;
Kolt hinskii, 2006). Nevertheless, these penalties are still larger than penid (m)

11
and an be di ult to ompute in pra ti e be ause of several unknown on-
stants.
A non-asymptoti analysis of several global and lo al penalties an be found
in the book by Massart (2007) for instan e; see also Kolt hinskii (2006) for
re ent results on lo al penalties.

3.5 Ad ho penalization
Let us nally mention that penalties an also be built a ording to parti ular
p
features of the problem. For instan e, penalties an be proportional to the
p
norm of sbm (similarly to -regularized learning algorithms) when having an
p
estimator with a ontrolled norm seems better. The penalty an also be
proportional to the squared norm of sbm in some reprodu ing kernel Hilbert
spa e (similarly to kernel ridge regression or spline smoothing), with a kernel
adapted to the spe i framework. More generally, any penalty an be used, as
soon as pen(m) is larger than the estimation error (to avoid overtting) and the
best model for the nal user is not the ora le m , but more like
arg min { ( s, Sm ) + pen(m) }
mMn

for some > 0.

3.6 Where are ross-validation pro edures in this pi ture?


The family of CV pro edures, whi h will be des ribed and deeply investigated
in the next se tions, ontains pro edures in the rst three ategories. CV pro e-
dures are all of the form (5), where crit(m) either estimates (almost) unbiasedly
the loss LP ( sbm ), or overestimates the varian e term (see Se tion 2.1). In the
latter ase, CV pro edures either belong to the se ond or the third ategory,
depending on the overestimation level.
This fa t has two major impli ations. First, CV itself does not take into
a ount prior information for sele ting a model. To do so, one an either add
to the CV estimate of the risk a penalty term (su h as kb
sm kp ), or use prior
information to pre-sele t a subset of models f
M(Dn ) Mn before letting CV
sele t a model among ( Sm )mM(D
f n).
Se ond, in statisti al learning, CV and resampling-based pro edures are the
most widely used model sele tion pro edures. Stru tural risk minimization is
often too pessimisti , and other alternatives rely on unrealisti assumptions.
But if CV and resampling-based pro edures are the most likely to yield good
predi tion performan es, their theoreti al grounds are not that rm, and too
few CV users are areful enough when hoosing a CV pro edure to perform
model sele tion. Among the aims of this survey is to point out both positive
and negative results about the model sele tion performan e of CV.

4 Cross-validation pro edures


The purpose of this se tion is to des ribe the rationale behind CV and to dene
the dierent CV pro edures. Sin e all CV pro edures are of the form (5),
dening a CV pro edure amounts to dene the orresponding CV estimator of
the risk of an algorithm A, whi h will be crit() in (5).

12
4.1 Cross-validation philosophy
As noti ed in the early 30s by Larson (1931), training an algorithm and evaluat-
ing its statisti al performan e on the same data yields an overoptimisti result.
CV was raised to x this issue (Mosteller and Tukey, 1968; Stone, 1974; Geisser,
1975), starting from the remark that testing the output of the algorithm on new
data would yield a good estimate of its performan e (Breiman, 1998).
In most real appli ations, only a limited amount of data is available, whi h
led to the idea of splitting the data: Part of the data (the training sample) is
used for training the algorithm, and the remaining data (the validation sample)
is used for evaluating its performan e. The validation sample an play the role
of new data as soon as data are i.i.d..
Data splitting yields the validation estimate of the risk, and averaging over
several splits yields a ross-validation estimate of the risk. As will be shown in
Se tions 4.2 and 4.3, various splitting strategies lead to various CV estimates of
the risk.
The major interest of CV lies in the universality of the data splitting heuris-
ti s, whi h only assumes that data are identi ally distributed and the train-
ing and validation samples are independent, two assumptions whi h an even
be relaxed (see Se tion 8.3). Therefore, CV an be applied to (almost) any
algorithm in (almost) any framework, for instan e regression (Stone, 1974;
Geisser, 1975), density estimation (Rudemo, 1982; Stone, 1984) and lassi-
ation (Devroye and Wagner, 1979; Bartlett et al., 2002), among many others.
On the ontrary, most other model sele tion pro edures (see Se tion 3) are spe-
i to a framework: For instan e, Cp (Mallows, 1973) is spe i to least-squares
regression.

4.2 From validation to ross-validation


In this se tion, the hold-out (or validation) estimator of the risk is dened,
leading to a general denition of CV.

4.2.1 Hold-out
The hold-out (Devroye and Wagner, 1979) or (simple) validation relies on a sin-
(t)
gle split of data. Formally, let I be a non-empty proper subset of { 1, . . . , n },
(t) (v)
c
that is, su h that both I and its omplement I = I (t) = { 1, . . . , n } \I (t)
are non-empty. The hold-out estimator of the risk of A(Dn ) with training set
I (t) is dened by
  1 X  
LbHO A; Dn ; I (t) := A(Dn(t) ); (Xi , Yi ) , (8)
nv (v)
iDn

(t)
where Dn := (i )iI (t) is the training sample, of size nt = Card(I
(t)
), and
(v)
Dn := (i )iI (v) is the validation sample, of size nv = n nt ; I (v) is alled the
(t)
validation set. The question of hoosing nt , and I given its ardinality nt , is
dis ussed in the rest of this survey.

13
4.2.2 General denition of ross-validation
A general des ription of the CV strategy has been given by Geisser (1975): In
brief, CV onsists in averaging several hold-out estimators of the risk orre-
sponding to dierent splits of the data. Formally, let B 1 be an integer and
(t) (t)
I1 , . . . , IB be a sequen e of non-empty proper subsets
 of { 1, . . . , n }. The CV
(t)
estimator of the risk of A(Dn ) with training sets Ij is dened by
1jB

   
1 X bHO  
B
(t) (t)
LbCV A; Dn ; Ij := L A; Dn ; Ij . (9)
1jB B j=1

All existing CV estimators of the risk are of the form (9), ea h one being uniquely
 
(t)
determined by the way the sequen e Ij is hosen, that is, the hoi e
1jB
of the splitting s heme.
Note that when CV is used in model sele tion for identi ation, an alterna-
tive denition of CV was proposed by Yang (2006, 2007) and alled CV with
voting (CV-v). When two algorithms A1 and A2 are ompared, A1 is sele ted
by CV-v if and only if L bHO (A1 ; Dn ; I (t) ) < LbHO (A2 ; Dn ; I (t) ) for a majority
j j
of the splits j = 1, . . . , B . By ontrast, CV pro edures of the form (9) an
be alled CV with averaging (CV-a), sin e the estimates of the risk of the
algorithms are averaged before their omparison.

4.3 Classi al examples


Most lassi al CV estimators split the data with a xed size nt of the training
(t)
Card(Ij ) nt for every j . The question of hoosing nt is dis ussed
set, that is,
extensively in the rest of this survey. In this subse tion, several CV estimators
are dened. Two main ategories of splitting s hemes an be distinguished,
given nt : I (t) of
exhaustive data splitting, that is onsidering all training sets
size nt , and partial data splitting.

4.3.1 Exhaustive data splitting


Leave-one-out (LOO, Stone, 1974; Allen, 1974; Geisser, 1975) is the most
lassi al exhaustive CV pro edure, orresponding to the hoi e nt = n 1 :
Ea h data point is su essively left out from the sample and used for validation.
(t) c
Formally, LOO is dened by (9) with B = n and Ij = { j } for j = 1, . . . , n :

1 X   (j)  
n
LbLOO ( A; Dn ) = A Dn ; j (10)
n j=1

(j)
where Dn= ( i )i6=j . The name LOO an be tra ed ba k to papers by
Pi ard and Cook (1984) and by Breiman and Spe tor (1992), but LOO has sev-
eral other names in the literature, su h as delete-one CV (see Li, 1987), ordinary
CV (Stone, 1974; Burman, 1989), or even only CV (Efron, 1983; Li, 1987).

14
Leave-p-out (LPO, Shao, 1993) with p { 1, . . . , n } is the exhaustive CV
with nt = n p : every possible set of p data points are su essively left out
from the sample and used for validation. Therefore, LPO is dened by (9) with
 (t)
B = np and (Ij )1jB are all the subsets of { 1, . . . , n } of size p. LPO is also
alled delete-p CV or delete-p multifold CV (Zhang, 1993). Note that LPO with
p=1 is LOO.

4.3.2 Partial data splitting


n

Considering
p training sets an be omputationally intra table, even for small
p, so that partial data splitting methods have been proposed.

V-fold CV (VFCV) with V { 1, . . . , n } was introdu ed by Geisser (1975) as


an alternative to the omputationally expensive LOO (see also Breiman et al.,
1984, for instan e). VFCV relies on a preliminary partitioning of the data into V
subsamples of approximately equal ardinality n/V ; ea h of these subsamples
su essively plays the role of validation sample. Formally, let A1 , . . . , AV be
some partition of { 1, . . . , n } with Card ( Aj ) n/V . Then, the VFCV estimator
(t)
of the risk of A is dened by (9) with B = V and Ij = Acj for j = 1, . . . , B ,
that is,

  1 XV
1 X    
LbVF sb; Dn ; ( Aj )1jV = sb Dn(Aj ) ; i (11)
V j=1 Card(Aj )
iAj

(Aj )
where Dn = ( i )iAc . By onstru tion, the algorithmi omplexity of
j
VFCV is only V times that of training A with n n/V data points, whi h
is mu h less than LOO or LPO if V n. Note that VFCV with V = n is LOO.

Balan ed In omplete CV (BICV, Shao, 1993) an be seen as an alternative


to VFCV well-suited for small training sample sizes nt . Indeed, BICV is dened
c
by (9) with training sets ( A )AT , where T is a balan ed in omplete blo k
designs (BIBD, John, 1971), that is, a olle tion of B > 0 subsets of { 1, . . . , n }
of size nv = n nt su h that:

1. Card { A T s.t. k A} does not depend on k { 1, . . . , n }.


2. Card { A T s.t. k, A } does not depend on k 6= { 1, . . . , n }.
The idea of BICV is to give to ea h data point (and ea h pair of data points)
the same role in the training and validation tasks. Note that VFCV relies on a
similar idea, sin e the set of training sample indi es used by VFCV satisfy the
rst property and almost the se ond one: Pairs (k, ) belonging to the same Aj
appear in one validation set more than other pairs.

Repeated learning-testing (RLT) was introdu ed by Breiman et al. (1984)


and further studied by Burman (1989) and by Zhang (1993) for instan e. The
(t)
RLT estimator of the risk of A is dened by (9) with any B > 0 and (Ij )1jB
are B dierent subsets of { 1, . . . , n }, hosen randomly and independently from
the data. RLT an be seen as an approximation to LPO with
 p = n nt , with
n
whi h it oin ides when B =
p .

15
Monte-Carlo CV (MCCV, Pi ard and Cook, 1984) is very lose to RLT: B
independent subsets of { 1, . . . , n } are randomly drawn, with uniform distribu-
tion among subsets of size nt . The only dieren e with RLT is that MCCV
allows the same split to be hosen several times.

4.3.3 Other ross-validation-like risk estimators


Several pro edures have been introdu ed whi h are lose to, or based on CV.
Most of them aim at xing an observed drawba k of CV.

Bias- orre ted versions of VFCV and RLT risk estimators have been pro-
posed by Burman (1989, 1990), and a losely related penalization pro edure
alled V -fold penalization has been dened by Arlot (2008 ), see Se tion 5.1.2
for details.

Generalized CV (GCV, Craven and Wahba, 1979) was introdu ed as a


rotation-invariant version of LOO in least-squares regression, for estimating the
b = M Y where Y = (Yi )1in Rn and M is an
risk of a linear estimator s
nn matrix independent from Y:
2 n
X
n1 kY M Yk 2
critGCV (M, Y) := 2 where t Rn , ktk = t2i .
( 1 n1 tr(M ) ) i=1

GCV is a tually loser to CL (Mallows, 1973) than to CV, sin e GCV an be


seen as an approximation to CL with a parti ular estimator of the varian e
(Efron, 1986). The e ien y of GCV has been proved in various frameworks,
in parti ular by Li (1985, 1987) and by Cao and Golubev (2006).

Analyti Approximation When CV is used for sele ting among linear mod-
els, Shao (1993) proposed an analyti approximation to LPO with p n, whi h
is alled APCV.

LOO bootstrap and .632 bootstrap The bootstrap is often used for stabi-

lizing an estimator or an algorithm, repla ing A(Dn ) by the average of A(Dn )

over several bootstrap resamples Dn . This idea was applied by Efron (1983)
to the LOO estimator of the risk, leading to the LOO bootstrap. Noting that
the LOO bootstrap was biased, Efron (1983) gave a heuristi argument leading
to the .632 bootstrap estimator of the risk, later modied into the .632+ boot-
strap by Efron and Tibshirani (1997). The main drawba k of these pro edures
is the weakness of their theoreti al justi ations. Only empiri al studies have
supported the good behaviour of .632+ bootstrap (Efron and Tibshirani, 1997;
Molinaro et al., 2005).

4.4 Histori al remarks


Simple validation or hold-out was the rst CV-like pro edure. It was introdu ed
in the psy hology area (Larson, 1931) from the need for a reliable alternative
to the resubstitution error, as illustrated by Anderson et al. (1972). The hold-
out was used by Herzberg (1969) for assessing the quality of predi tors. The
problem of hoosing the training set was rst onsidered by Stone (1974), where

16
 ontrollable and un ontrollable data splits were distinguished; an instan e
of un ontrollable division an be found in the book by Simon (1971).
A primitive LOO pro edure was used by Hills (1966) and by
La henbru h and Mi key (1968) for evaluating the error rate of a predi -
tion rule, and a primitive formulation of LOO an be found in a paper by
Mosteller and Tukey (1968). Nevertheless, LOO was a tually introdu ed inde-
pendently by Stone (1974), by Allen (1974) and by Geisser (1975). The rela-
tionship between LOO and the ja kknife (Quenouille, 1949), whi h both rely on
the idea of removing one observation from the sample, has been dis ussed by
Stone (1974) for instan e.
The hold-out and CV were originally used only for estimating the risk of an
algorithm. The idea of using CV for model sele tion arose in the dis ussion of
a paper by Efron and Morris (1973) and in a paper by Geisser (1974). The rst
author to study LOO as a model sele tion pro edure was Stone (1974), who
proposed to use LOO again for estimating the risk of the sele ted model.

5 Statisti al properties of ross-validation esti-


mators of the risk
Understanding the behaviour of CV for model sele tion, whi h is the purpose
of this survey, requires rst to analyze the performan es of CV as an estimator
of the risk of a single algorithm. Two main properties of CV estimators of the
risk are of parti ular interest: their bias, and their varian e.

5.1 Bias
Dealing with the bias in urred by CV estimates an be made by two strategies:
evaluating the amount of bias in order to hoose the least biased CV pro edure,
or orre ting for this bias.

5.1.1 Theoreti al assessment of the bias


The independen e of the training and the validation samples imply that for
A and any I (t) { 1, . . . , n } with ardinality nt ,
every algorithm
h  i h    i
E LbHO A; Dn ; I (t) = E A Dn(t) ; = E [ LP ( A ( Dnt ) ) ] .

(t)
Therefore, assuming that Card(Ij ) = nt for j = 1, . . . , B , the expe tation of
the CV estimator of the risk only depends on nt :
    
(t)
E LbCV A; Dn ; Ij = E [ LP ( A ( Dnt ) ) ] . (12)
1jB

In parti ular (12) shows that the bias of the CV estimator of the risk of A is
the dieren e between the risks of A, omputed respe tively with nt and n data
points. Sin e nt < n, the bias of CV is usually nonnegative, whi h an be proved
rigorously when the risk of A is a de reasing fun tion of n, that is, when A is a
smart rule; note however that a lassi al algorithm su h as 1-nearest-neighbour
in lassi ation is not smart (Devroye et al., 1996, Se tion 6.8). Similarly, the
bias of CV tends to de rease with nt , whi h is rigorously true if A is smart.

17
More pre isely, (12) has led to several results on the bias of CV, whi h an be
split into three main ategories: asymptoti results (A is xed and the sample
size n tends to innity), non-asymptoti results (where A is allowed to make use
of a number of parameters growing with n, say n1/2 , as often in model sele tion),
and empiri al results. They are listed below by statisti al framework.

Regression The general behaviour of the bias of CV (positive, de reasing


with nt ) is onrmed by several papers and for several CV estimators. For
LPO, non-asymptoti expressions of its bias were proved by Celisse (2008b) for
proje tion estimators, and by Arlot and Celisse (2009) for regressograms and
kernels estimators when the design is xed. For VFCV and RLT, an asymptoti
expansion of their bias was yielded by Burman (1989) for least-squares esti-
mators in linear regression, and extended to spline smoothing (Burman, 1990).
Note nally that Efron (1986) proved non-asymptoti analyti expressions of
the expe tations of the LOO and GCV estimators of the risk in regression with
binary data (see also Efron, 1983, for some expli it al ulations).

Density estimation shows a similar pi ture. Non-asymptoti expressions


for the bias of LPO estimators for kernel and proje tion estimators with the
quadrati risk were proved by Celisse and Robin (2008) and by Celisse (2008a).
Asymptoti expansions of the bias of the LOO estimator for histograms and ker-
nel estimators were previously proved by Rudemo (1982); see Bowman (1984) for
simulations. Hall (1987) derived similar results with the log-likelihood ontrast
for kernel estimators, and related the performan e of LOO to the intera tion
between the kernel and the tails of the target density s.

Classi ation For the simple problem of dis riminating between two popula-
tions with shifted distributions, Davison and Hall (1992) ompared the asymp-
toti al bias of LOO and bootstrap, showing the superiority of the LOO when
1/2
the shift size is n : As n tends to innity, the bias of LOO stays of or-
1 1/2
der n , whereas that of bootstrap worsens to the order n . On realisti
syntheti and real biologi al data, Molinaro et al. (2005) ompared the bias of
LOO, VFCV and .632+ bootstrap: The bias de reases with nt , and is generally
minimal for LOO. Nevertheless, the 10-fold CV bias is nearly minimal uniformly
over their experiments. In the same experiments, .632+ bootstrap exhibits the
smallest bias for moderate sample sizes and small signal-to-noise ratios, but a
mu h larger bias otherwise.

CV- alibrated algorithms When a family of algorithm ( A ) is given,


and b
is hosen by minimizing LbCV (A ; Dn ) over , LbCV (Ab ; Dn ) is biased
for estimating the risk of Ab (Dn ), as reported from simulation experiments
by Stone (1974) for the LOO, and by Jonathan et al. (2000) for VFCV in the
variable sele tion setting. This bias is of dierent nature ompared to the pre-
vious frameworks. Indeed, b was hosen
LbCV (Ab , Dn ) is biased simply be ause
using the same data as L bCV (A , Dn ). This phenomenon is similar to the op-
timism of LPn ( sb (Dn ) ) as an estimator of the loss of sb (Dn ). The orre t
way of estimating the risk of Ab (Dn ) with CV is to onsider the full algorithm

18
A : Dn 7 A(D
b n ) (Dn ), and then to ompute LbCV ( A ; Dn ). The resulting

pro edure is alled double ross by Stone (1974).

5.1.2 Corre tion of the bias


An alternative to hoosing the CV estimator with the smallest bias is to orre t
for the bias of the CV estimator of the risk. Burman (1989, 1990) proposed a
orre ted VFCV estimator, dened by

1 X
V  
LbcorrVF (A; Dn ) = LbVF ( sb; Dn ) + LPn ( A(Dn ) ) LPn A(Dn(Aj ) ) ,
V j=1

and a orre ted RLT estimator was dened similarly. Both estimators have
been proved to be asymptoti ally unbiased for least-squares estimators in linear
regression.
When the Aj s have exa tly the same size n/V , the orre ted VFCV riterion
is equal to the sum of the empiri al risk and the V -fold penalty (Arlot, 2008 ),
dened by

V 1 Xh    i
V
penVF (A; Dn ) = LPn A(Dn(Aj ) ) L (Aj ) A(Dn(Aj ) ) .
V j=1 Pn

The V -fold penalized riterion was proved to be (almost) unbiased in the non-
asymptoti framework for regressogram estimators.

5.2 Varian e
CV estimators of the risk using training sets of the same size nt have
the same bias,
but they still behave quite dierently; their varian e
bCV (t)
var(L (A; Dn ; (Ij )1jB )) aptures most of the information to explain these
dieren es.

5.2.1 Variability fa tors


(t)
Assume that Card(Ij ) = nt for every j. The varian e of CV results from the
ombination of several fa tors, in parti ular (nt , nv ) and B.

Inuen e of (nt , nv ) Let us onsider the hold-out estimator of the risk. Fol-
lowing in parti ular Nadeau and Bengio (2003),
h  i
var LbHO A; Dn ; I (t)
h    i

= E var LP (v) A(Dn(t) ) Dn(t) + var [ LP ( A(Dnt ) ) ]
n

1 h   i
= E var ( sb , ) | sb = A(Dn(t) ) + var [ LP ( A(Dnt ) ) ] . (13)
nv
The rst term, proportional to 1/nv , shows that more data for validation
bHO , be ause it yields a better estimator of
de reases the varian e of L
 
(t)
LP A(Dn ) . The se ond term shows that the varian e of LbHO also depends
 
(t)
on the distribution of LP A(Dn ) around its expe tation; in parti ular, it

strongly depends on the stability of A.

19
Stability and varian e When A is unstable, LbLOO ( A ) has often been
pointed out as a variable estimator (Se tion 7.10, Hastie et al., 2001; Breiman,
1996). Conversely, this trend disappears when A is stable, as noti ed by
Molinaro et al. (2005) from a simulation experiment.
The relation between the stability of A and the varian e of LbCV ( A ) was
pointed out by Devroye and Wagner (1979) in lassi ation, through upper
bLOO ( A ). Bousquet and Elisse (2002) extended
bounds on the varian e of L
these results to the regression setting, and proved upper bounds on the maxi-
mal upward deviation of L bLOO ( A ).
Note nally that several approa hes based on the bootstrap have been pro-
posed for redu ing the varian e of LbLOO ( A ), su h as LOO bootstrap, .632
bootstrap and .632+ bootstrap (Efron, 1983); see also Se tion 4.3.3.

Partial splitting and varian e When (nt , nv ) is xed, the variability of


CV tends to be larger for partial data splitting methods than for LPO. Indeed,
n
 (t)
having to hoose B <
nt subsets (Ij )1jB of { 1, . . . , n }, usually randomly,
indu es an additional variability ompared to L bLPO with p = n nt . In the
1 (t)
ase of MCCV, this variability de reases like B sin e the Ij are hosen
independently. The dependen e on B is slightly dierent for other CV estimators
(t)
su h as RLT or VFCV, be ause the Ij are not independent. In parti ular, it
is maximal for the hold-out, and minimal (null) for LOO (if nt = n 1 ) and
LPO (with p = n nt ).
Note that the dependen e on V for VFCV is more omplex to evaluate, sin e
B , nt , and nv simultaneously vary with V. Nevertheless, a non-asymptoti the-
oreti al quanti ation of this additional variability of VFCV has been obtained
by Celisse and Robin (2008) in the density estimation framework (see also em-
piri al onsiderations by Jonathan et al., 2000).

5.2.2 Theoreti al assessment of the varian e


Understanding pre isely how var(LbCV (A)) depends on the splitting s heme is
omplex in general, sin e nt and nv have a xed sum n, and the number of splits
B is generally linked with nt (for instan e, for LPO and VFCV). Furthermore,
the varian e of CV behaves quite dierently in dierent frameworks, depending
in parti ular on the stability of A. The onsequen e is that ontradi tory results
have been obtained in dierent frameworks, in parti ular on the value of V
for whi h the VFCV estimator of the risk has a minimal varian e (Burman,
1989; Hastie et al., 2001, Se tion 7.10). Despite the di ulty of the problem,
the varian e of several CV estimators of the risk has been assessed in several
frameworks, as detailed below.

Regression In the linear regression setting, Burman (1989) yielded asymp-


toti expansions of the varian e of the VFCV and RLT estimators of the risk
with homos edasti data. The varian e of RLT de reases with B, and in the
ase of VFCV, in a parti ular setting,

  2 2  
bVF 4 4 4 2 1 
var L (A) = + 2 4+ + 2
+ 3
+ o n2 .
n n V 1 (V 1) (V 1)

20
The asymptoti al varian e of the VFCV estimator of the risk de reases with V,
implying that LOO asymptoti ally has the minimal varian e.
Non-asymptoti losed-form formulas of the varian e of the LPO estimator
of the risk have been proved by Celisse (2008b) in regression, for proje tion and
kernel estimators for instan e. On the varian e of RLT in the regression setting,
see the asymptoti results of Girard (1998) for Nadaraya-Watson kernel estima-
tors, as well as the non-asymptoti omputations and simulation experiments
by Nadeau and Bengio (2003) with several learning algorithms.

Density estimation Non-asymptoti losed-form formulas of the varian e of


the LPO estimator of the risk have been proved by Celisse and Robin (2008)
and by Celisse (2008a) for proje tion and kernel estimators. In parti ular, the
dependen e of the varian e of L bLPO on p has been quantied expli itly for
histogram and kernel estimators by Celisse and Robin (2008).

Classi ation For the simple problem of dis riminating between two popu-
lations with shifted distributions, Davison and Hall (1992) showed that the gap
between asymptoti varian es of LOO and bootstrap be omes larger when data
are noisier. Nadeau and Bengio (2003) made non-asymptoti omputations and
simulation experiments with several learning algorithms. Hastie et al. (2001)
empiri ally showed that VFCV has a minimal varian e for some 2 < V < n,
whereas LOO usually has a large varian e; this fa t ertainly depends on the
stability of the algorithm onsidered, as showed by simulation experiments by
Molinaro et al. (2005).

5.2.3 Estimation of the varian e


There is no universalvalid under all distributionsunbiased estimator
of the varian e of RLT (Nadeau and Bengio, 2003) and VFCV estimators
(Bengio and Grandvalet, 2004). In parti ular, Bengio and Grandvalet (2004)
re ommend the use of varian e estimators taking into a ount the orrelation
stru ture between test errors; otherwise, the varian e of CV an be strongly
underestimated.
Despite these negative results, (biased) estimators of the varian e of LbCV
have been proposed by Nadeau and Bengio (2003), by Bengio and Grandvalet
(2004) and by Markatou et al. (2005), and tested in simulation experiments in
regression and lassi ation. Furthermore, in the framework of density estima-
tion with histograms, Celisse and Robin (2008) proposed an estimator of the
varian e of the LPO risk estimator. Its a ura y is assessed by a on entration
inequality. These results have re ently been extended to proje tion estimators
by Celisse (2008a).

6 Cross-validation for e ient model sele tion


This se tion ta kles the properties of CV pro edures for model sele tion when
the goal is estimation (see Se tion 2.2).

21
6.1 Relationship between risk estimation and model se-
le tion
As shown in Se tion 3.1, minimizing an unbiased estimator of the risk leads to an
e ient model sele tion pro edure. One ould on lude here that the best CV
pro edure for estimation is the one with the smallest bias and varian e (at least
asymptoti ally), for instan e, LOO in the least-squares regression framework
(Burman, 1989).
Nevertheless, the best CV estimator of the risk is not ne essarily the best
model sele tion pro edure. For instan e, Breiman and Spe tor (1992) observed
that uniformly over the models, the best risk estimator is LOO, whereas 10-
fold CV is more a urate for model sele tion. Three main reasons for su h a
dieren e an be invoked. First, the asymptoti framework (A xed, n )
may not apply to models lose to the ora le, whi h typi ally has a dimension
growing with n when s does not belong to any model. Se ond, as explained in
Se tion 3.2, estimating the risk of ea h model with some bias an be bene ial
and ompensate the ee t of a large varian e, in parti ular when the signal-to-
noise ratio is small. Third, for model sele tion, what matters is not that every
estimate of the risk has small bias and varian e, but more that

sign ( crit(m1 ) crit(m2 ) ) = sign ( LP ( sbm1 ) LP ( sbm2 ) )

with the largest probability for models m1 , m2 near the ora le.
Therefore, spe i studies are required to evaluate the performan es of the
various CV pro edures in terms of model sele tion e ien y. In most frame-
works, the model sele tion performan e dire tly follows from the properties of
CV as an estimator of the risk, but not always.

6.2 The global pi ture


Let us start with the lassi ation of model sele tion pro edures made by Shao
(1997) in the linear regression framework, sin e it gives a good idea of the
performan e of CV pro edures for model sele tion in general. Typi ally, the
e ien y of CV only depends on the asymptoti s of nt /n :
When nt n, CV is asymptoti ally equivalent to Mallows' Cp , hen e
asymptoti ally optimal.

When nt n with (0, 1), CV is asymptoti ally equivalent to GIC


with = 1+1 , whi h is dened as AIC with a penalty multiplied by /2.
Hen e, su h CV pro edures are overpenalizing by a fa tor (1+)/(2) > 1.

The above results have been proved by Shao (1997) for LPO (see also Li, 1987,
for the LOO); they also hold for RLT whenB n2 sin e RLT is then equivalent
to LPO (Zhang, 1993).
In a general statisti al framework, the model sele tion performan e
of MCCV, VFCV, LOO, LOO Bootstrap, and .632 bootstrap for se-
le tion among minimum ontrast estimators was studied in a series
of papers (van der Laan and Dudoit, 2003; van der Laan et al., 2004, 2006;
van der Vaart et al., 2006); these results apply in parti ular to least-squares
regression and density estimation. It turns out that under mild onditions, an
ora le-type inequality is proved, showing that up to a multiplying fa tor Cn 1,

22
the risk of CV is smaller than the minimum of the risks of the models with a
sample size nt . In parti ular, in most frameworks, this implies the asymptoti
optimality of CV as soon as nt n. When nt n with (0, 1), this
naturally generalizes Shao's results.

6.3 Results in various frameworks


This se tion gathers results about model sele tion performan es of CV when
the goal is estimation, in various frameworks. Note that model sele tion is on-
sidered here with a general meaning, in luding in parti ular bandwidth hoi e
for kernel estimators.

Regression First, the results of Se tion 6.2 suggest that CV is suboptimal


when nt is not asymptoti ally equivalent to n. This fa t has been proved rigor-
ously for VFCV when V = O(1) with regressograms (Arlot, 2008 ): with large
probability, the risk of the model sele ted by VFCV is larger than 1 + (V )
times the risk of the ora le, with (V ) > 0 for every xed V. Note however
that the best V for VFCV is not the largest one in every regression frame-
work, as shown empiri ally in linear regression (Breiman and Spe tor, 1992;
Herzberg and Tsukanov, 1986); Breiman (1996) proposed to explain this phe-
nomenon by relating the stability of the andidate algorithms and the model
sele tion performan e of LOO in various regression frameworks.
Se ond, the universality of CV has been onrmed by showing that it natu-
rally adapts to heteros edasti ity of data when sele ting among regressograms.
Despite its suboptimality, VFCV with V = O(1) satises a non-asymptoti
ora le inequality with onstant C>1 (Arlot, 2008 ). Furthermore, V -fold pe-
nalization (whi h often oin ides with orre ted VFCV, see Se tion 5.1.2) sat-
ises a non-asymptoti ora le inequality with Cn 1 as n +, both when
V = O(1) (Arlot, 2008 ) and when V =n (Arlot, 2008a). Note that n-fold pe-
nalization is very lose to LOO, suggesting that it is also asymptoti ally optimal
with heteros edasti data. Simulation experiments in the ontext of hange-
point dete tion onrmed that CV adapts well to heteros edasti ity, ontrary
to usual model sele tion pro edures in the same framework (Arlot and Celisse,
2009).
The performan es of CV have also been assessed for other kinds of estimators
in regression. For hoosing the number of knots in spline smoothing, Burman
(1990) proved that orre ted versions of VFCV and RLT are asymptoti ally
optimal provided n/(Bnv ) = O(1). Furthermore, in kernel regression, several
CV methods have been ompared to GCV in kernel regression by Hrdle et al.
(1988) and by Girard (1998); the on lusion is that GCV and related riteria
are omputationally more e ient than MCCV or RLT, for a similar statisti al
performan e.
Finally, note that asymptoti results about CV in regression have been
proved by Gyr et al. (2002), and an ora le inequality with onstant C > 1 has
been proved by Wegkamp (2003) for the hold-out, with least-squares estimators.

Density estimation CV performs similarly than in regression for sele ting


among least-squares estimators (van der Laan et al., 2004): It yields a risk
smaller than the minimum of the risk with a sample size nt . In parti ular,

23
non-asymptoti ora le inequalities with onstant C > 1 have been proved by
Celisse (2008b) for the LPO when p/n [a, b], 0 < a < b < 1.
for some
The performan e of CV for sele ting the bandwidth of kernel density esti-
mators has been studied in several papers. With the least-squares ontrast, the
e ien y of LOO was proved by Hall (1983) and generalized to the multivari-
ate framework by Stone (1984); an ora le inequality asymptoti ally leading to
e ien y was re ently proved by Dalelane (2005). With the Kullba k-Leibler
divergen e, CV an suer from troubles in performing model sele tion (see also
S huster and Gregory, 1981; Chow et al., 1987). The inuen e of the tails of
the target s was studied by Hall (1987), who gave onditions under whi h CV
is e ient and the hosen bandwidth is optimal at rst-order.

Classi ation In the framework of binary lassi ation by intervals (that is,
with X = [0, 1] and pie ewise onstant lassiers), Kearns et al. (1997) proved
an ora le inequality for the hold-out. Furthermore, empiri al experiments show
that CV yields (almost) always the best performan e, ompared to deterministi
penalties (Kearns et al., 1997). On the ontrary, simulation experiments by
Bartlett et al. (2002) in the same setting showed that random penalties su h as
Radema her omplexity and maximal dis repan y usually perform mu h better
than hold-out, whi h is shown to be more variable.
Nevertheless, the hold-out still enjoys quite good theoreti al properties: It
was proved to adapt to the margin ondition by Blan hard and Massart (2006),
a property nearly una hievable with usual model sele tion pro edures (see also
Massart, 2007, Se tion 8.5). This suggests that CV pro edures are naturally
adaptive to several unknown properties of data in the statisti al learning frame-
work.
The performan e of the LOO in binary lassi ation was related to the
stability of the andidate algorithms by Kearns and Ron (1999); they proved
ora le-type inequalities alled sanity- he k bounds, des ribing the worst- ase
performan e of LOO (see also Bousquet and Elisse, 2002).
An experimental omparison of several CV methods and bootstrap-based
CV (in parti ular .632+ bootstrap) in lassi ation an also be found in papers
by Efron (1986) and Efron and Tibshirani (1997).

7 Cross-validation for identi ation


Let us now fo us on model sele tion when the goal is to identify the true model
S m0 , as des ribed in Se tion 2.3. In this framework, asymptoti optimality is
repla ed by (model) onsisten y, that is,

P ( m(D
b n ) = m0 ) 1 .
n

Classi al model sele tion pro edures built for identi ation, su h as BIC, are
des ribed in Se tion 3.3.

7.1 General onditions towards model onsisten y


At rst sight, it may seem strange to use CV for identi ation: LOO, whi h
is the pioneering CV pro edure, is a tually losely related to the unbiased risk

24
estimation prin iple, whi h is only e ient when the goal is estimation. Fur-
thermore, estimation and identi ation are somehow ontradi tory goals, as
explained in Se tion 2.4.
This intuition about in onsisten y of some CV pro edures is onrmed by
several theoreti al results. Shao (1993) proved that several CV methods are
in onsistent for variable sele tion in linear regression: LOO, LPO, and BICV
when lim inf n (nt /n) > 0. Even if these CV methods asymptoti ally sele t
all the true variables with probability 1, the probability that they sele t too
mu h variables does not tend to zero. More generally, Shao (1997) proved that
CV pro edures behave asymptoti ally like GICn with n = 1 + n/nt , whi h
leads to in onsisten y as soon as n/nt = O(1).
In the ontext of ordered variable sele tion in linear regression, Zhang (1993)
omputed the asymptoti value of the probability of sele ting the true model
for several CV pro edures. He also numeri ally ompared the values of this
probability for the same CV pro edures in a spe i example. For LPO with
p/n (0, 1) as n tends to +, P ( m
b = m0 ) in reases with . The result is
slightly dierent for VFCV: P (m
b = m0 ) in reases with V (hen e, it is maximal
for the LOO, whi h is the worst ase of LPO). The variability indu ed by the
number V of splits seems to be more important here than the bias of VFCV.
Nevertheless, P (m
b = m0 ) is almost onstant between V = 10 and V = n, so
that taking V > 10 is not advised for omputational reasons.
These results suggest that if the training sample size nt is negligible in front
of n, then model onsisten y ould be obtained. This has been onrmed theo-
reti ally by Shao (1993, 1997) for the variable sele tion problem in linear regres-
sion: CV is onsistent when n nt , in parti ular RLT, BICV (dened in
Se tion 4.3.2) and LPO with p = pn n and n pn .
Therefore, when the goal is to identify the true model, a larger proportion of
the data should be put in the validation set in order to improve the performan e.
This phenomenon is somewhat related to the ross-validation paradox (Yang,
2006).

7.2 Rened analysis for the algorithm sele tion problem


The behaviour of CV for identi ation is better understood by onsidering a
more general framework, where the goal is to sele t among statisti al algorithms
the one with the fastest onvergen e rate. Yang (2006, 2007) onsidered this
problem for two andidate algorithms (or more generally any nite number of
algorithms). Let us mention here that Stone (1977) onsidered a few spe i
examples of this problem, and showed that LOO an be in onsistent for hoosing
the best among two good estimators.
The on lusion of Yang's papers is that the su ient ondition on nt for
the onsisten y in sele tion of CV strongly depends on the onvergen e rates
( rn,i )i=1,2 of the andidate algorithms. Let us assume that rn,1 and rn,2 dier
at least by a multipli ative onstant C > 1. Then, in the regression framework,
if the risk of sbi is measured by E kb
si sk2 , Yang (2007) proved that the hold-
out, VFCV, RLT and LPO with voting (CV-v, see Se tion 4.2.2) are onsistent
in sele tion if

nv , nt and nv max rnt ,i , (14)
i

25
under some onditions on kb
si skp for p = 2, 4, . In the lassi ation frame-
work, if the risk of sbi is measured by P ( s
bi 6= s ), Yang (2006) proved the same
onsisten y result for CV-v under the ondition

nv maxi rn2 t ,i
nv , nt and , (15)
snt

where sn is the onvergen e rate of P ( sb1 (Dn ) 6= sb2 (Dn ) ).


Intuitively, onsisten y holds as soon as the un ertainty of ea h estimate of
1/2
the risk (roughly proportional to nv ) is negligible in front of the risk gap
|rnt ,1 rnt ,2 | (whi h is of the same order as maxi rnt ,i ). This ondition holds
either when at least one of the algorithms onverges at a non-parametri rate,
or when nt n, whi h arti ially widens the risk gap.
Empiri al results in the same dire tion were proved by Dietteri h (1998)
and by Alpaydin (1999), leading to the advi e that V = 2 is the best hoi e
when VFCV is used for omparing two learning pro edures. See also the re-
sults by Nadeau and Bengio (2003) about CV onsidered as a testing pro edure
omparing two andidate algorithms.
The su ient onditions (14) and (15) an be simplied depending on
maxi rn,i , so that the ability of CV to distinguish between two algorithms de-
pends on their onvergen e rates. On the one hand, if maxi rn,i n1/2 , then
(14) or (15) only hold when nv nt (under some onditions on sn in
lassi ation). Therefore, the ross-validation paradox holds for omparing al-
gorithms onverging at the parametri rate (model sele tion when a true model
exists being only a parti ular ase). Note that possibly stronger onditions an
be required in lassi ation where algorithms an onverge at fast rates, between
n1 and n
1/2
.
On the other hand, (14) and (15) are milder onditions when maxi rn,i
n1/2 : They are implied by nt /nv = O(1), and they even allow nt n (under
some onditions on sn in lassi ation). Therefore, non-parametri algorithms
an be ompared by more usual CV pro edures (nt > n/2), even if LOO is still
ex luded by onditions (14) and (15).

Note that a ording to a simulation experiments, CV with averaging (that


is, CV as usual) and CV with voting are equivalent at rst but not at se ond
order, so that they an dier when n is small (Yang, 2007).

8 Spe i ities of some frameworks


Originally, the CV prin iple has been proposed for i.i.d. observations and usual
ontrasts su h as least-squares and log-likelihood. Therefore, CV pro edures
may have to be modied in other spe i frameworks, su h as estimation in
presen e of outliers or with dependent data.

8.1 Density estimation


In the density estimation framework, some spe i modi ations of CV have
been proposed.
First, Hall et al. (1992) dened the smoothed CV, whi h onsists in pre-
smoothing the data before using CV, an idea related to the smoothed bootstrap.

26
They proved that smoothed CV yields an ex ellent asymptoti al model sele tion
performan e under various smoothness onditions on the density.
Se ond, when the goal is to estimate the density at one point (and not
globally), Hall and S hu any (1989) proposed a lo al version of CV and proved
its asymptoti optimality.

8.2 Robustness to outliers


In presen e of outliers in regression, Leung (2005) studied how CV must be
modied to get both asymptoti e ien y and a onsistent bandwidth estimator
(see also Leung et al., 1993).
Two hanges are possible to a hieve robustness: Choosing a robust regres-
sor, or hoosing a robust loss-fun tion. In presen e of outliers, lassi al CV with
a non-robust loss fun tion has been shown to fail by Hrdle (1984).
Leung (2005) des ribed a CV pro edure based on robust losses like L1 and
Huber's (Huber, 1964) ones. The same strategy remains appli able to other
setups like linear models in Ron hetti et al. (1997).

8.3 Time series and dependent observations


As explained in Se tion 4.1, CV is built upon the heuristi s that part of the
sample (the validation set) an play the role of new data with respe t to the
rest of the sample (the training set). New means that the validation set is
independent from the training set with the same distribution.
Therefore, when data 1 , . . . , n are not independent, CV must be modi-
ed, like other model sele tion pro edures (in non-parametri regression with
dependent data, see the review by Opsomer et al., 2001).

Let us rst onsider the statisti al framework of Se tion 1 with 1 , . . . , n


identi ally distributed but not independent. Then, when for instan e data are
positively orrelated, Hart and Wehrly (1986) proved that CV overts for hoos-
ing the bandwidth of a kernel estimator in regression (see also Chu and Marron,
1991; Opsomer et al., 2001).
The main approa h used in the literature for solving this issue is to hoose
I (t) and I
(v)
su h that miniI (t) , jI (v) |i j| > h > 0, where h ontrols the dis-
tan e from whi h observations i and j are independent. For instan e, the LOO
(v)
an be hanged into: I = { J } where J is uniformly hosen in { 1, . . . , n },
(t)
and I = { 1, . . . , J h 1, J + h + 1, . . . , n }, a method alled modied CV
by Chu and Marron (1991) in the ontext of bandwidth sele tion. Then, for
short range dependen es, i is almost independent from j when |i j| > h is
large enough, so that ( j )jI (t) is almost independent from ( j )jI (v) . Several
asymptoti optimality results have been proved on modied CV, for instan e
by Hart and Vieu (1990) for bandwidth hoi e in kernel density estimation,
when data are -mixing (hen e, with a short range dependen e stru ture) and
h = hn not too fast. Note that modied CV also enjoys some asymptoti
optimality results with long-range dependen es, as proved by Hall et al. (1995),
even if an alternative blo k bootstrap method seems more appropriate in su h
a framework.

Several alternatives to modied CV have also been proposed. The  h-blo k


CV (Burman et al., 1994) is modied CV plus a orre tive term, similarly to

27
the bias- orre ted CV by Burman (1989) (see Se tion 5.1). Simulation experi-
ments in several (short range) dependent frameworks show that this orre tive
term matters when h/n is not small, in parti ular when n is small.
The partitioned CV has been proposed by Chu and Marron (1991) for
bandwidth sele tion: An integer g > 0 is hosen, a bandwidth bk is hosen by

CV based upon the subsample ( k+gj )j0 for ea h k = 1, . . . , g , and the sele ted
bandwidth is a ombination of (bk ).
When a parametri model is available for the dependen y stru ture, Hart
(1994) proposed the time series CV.

An important framework where data often are dependent is time-series anal-


ysis, in parti ular when the goal is to predi t the next observation n+1 from
the past 1 , . . . , n . When data are stationary, h-blo k CV and similar ap-
proa hes an be used to deal with (short range) dependen es. Nevertheless,
Burman and Nolan (1992) proved in some spe i framework that unaltered
CV is asymptoti optimal when 1 , . . . , n is a stationary Markov pro ess.
On the ontrary, using CV for non-stationary time-series is a quite di ult
problem. The only reasonable approa h in general is the hold-out, that is,
(t) (v)
I = { 1, . . . , m }
and I = { m + 1, . . . , n } for some deterministi m. Ea h
model is rst trained with ( j )jI (t) . Then, it is used for predi ting su essively
m+1 from ( j )jm , m+2 from ( j )jm+1 , and so on. The model with the
smallest average error for predi ting ( j )
jI (v) from the past is hosen.

8.4 Large number of models


As mentioned in Se tion 3, model sele tion pro edures estimating unbiasedly
the risk of ea h model fail when, in parti ular, the number of models grows
exponentially with n (Birg and Massart, 2007). Therefore, CV annot be used
dire tly, ex ept maybe with nt n, provided nt is well hosen (see Se tion 6
and Celisse, 2008b, Chapter 6).
For least-squares regression with homos edasti data, Wegkamp (2003) pro-
posed to add to the hold-out estimator of the risk a penalty term depending
on the number of models. This method is proved to satisfy a non-asymptoti
ora le inequality with leading onstant C > 1.
Another general approa h was proposed by Arlot and Celisse (2009) in the
ontext of multiple hange-point dete tion. The idea is to perform model se-
le tion in two steps: First, gather the models ( Sm )mMn into meta-models
(SeD )DDn , where Dn denotes a set of indi es su h that Card(Dn ) grows at
n. eD = S
meta-model S
most polynomially with Inside ea h mMn (D) Sm , s
bD is
hosen from data by optimizing a given riterion, for instan e the empiri al on-
trast LPn ( t ), but other riteria an be used. Se ond, CV is used for hoosing
among ( sbD )DDn . Simulation experiments show this simple tri k automati ally
takes into a ount the ardinality of Mn , even when data are heteros edasti ,
ontrary to other model sele tion pro edures built for exponential olle tion of
models whi h all assume homos edasti ity of data.

28
9 Closed-form formulas and fast omputation
Resampling strategies, like CV, are known to be time onsuming. The naive im-
plementation of CV has a omputational omplexity of B times the omplexity
of training ea h algorithm A, whi h is usually intra table for LPO, even with
p = 1. The omputational ost of VFCV or RLT an still be quite ostly when
B > 10 in many pra ti al problems. Nevertheless, losed-form formulas for
CV estimators of the risk an be obtained in several frameworks, whi h greatly
de reases the omputational ost of CV.

In density estimation, losed-form formulas have been originally derived by


Rudemo (1982) and by Bowman (1984) for the LOO risk estimator of his-
tograms and kernel estimators. These results have been re ently extended by
Celisse and Robin (2008) to the LPO risk estimator with the quadrati loss.
Similar results are more generally available for proje tion estimators as settled
by Celisse (2008a). Intuitively, su h formulas an be obtained provided the

number N of values taken by theB = nnv hold-out estimators of the risk,
orresponding to dierent data splittings, is at most polynomial in the sample
size.

For least-squares estimators in linear regression, Zhang (1993) proved a


losed-form formula for the LOO estimator of the risk. Similar results have
been obtained by Wahba (1975, 1977), and by Craven and Wahba (1979) in the
spline smoothing ontext as well. These papers led in parti ular to the denition
of GCV (see Se tion 4.3.3) and related pro edures, whi h are often used instead
of CV (with a naive implementation) be ause of their small omputational ost,
as emphasized by Girard (1998).
Closed-form formulas for the LPO estimator of the risk were also obtained by
Celisse (2008b) in regression for kernel and proje tion estimators, in parti ular
for regressograms. An important property of these losed-form formulas is their
additivity: For a regressogram asso iated to a partition (I )m of X , the
LPO estimator of the risk an be written as a sum over m of terms
whi h only depend on observations (Xi , Yi ) su h that Xi I . Therefore,
dynami programming (Bellman and Dreyfus, 1962) an be used for minimizing
the LPO estimator of the risk over the set of partitions of X in D pie es. As
an illustration, Arlot and Celisse (2009) su essfully applied this strategy in the
hange-point dete tion framework. Note that the same idea an be used with
VFCV or RLT, but for a larger omputational ost sin e no losed-form formulas
are available for these CV methods.

Finally, in frameworks where no losed-form formula an be proved, some


(t)
e ient algorithms exist for avoiding to re ompute LbHO (A; Dn ; Ij ) from
(t)
s rat h for ea h data splitting Ij . These algorithms rely on updating formulas
su h as the ones by Ripley (1996) for LOO in linear and quadrati dis riminant
analysis; this approa h makes LOO as expensive to ompute as the empiri al
risk.
Very similar formulas are also available for LOO and the k -nearest neigh-
bours estimator in lassi ation (Daudin and Mary-Huard, 2008).

29
10 Con lusion: whi h ross-validation method
for whi h problem?
This on lusion olle ts a few guidelines aiming at helping CV users, rst in-
terpreting the results of CV, se ond appropriately using CV in ea h spe i
problem.

10.1 The general pi ture


Drawing a general on lusion on CV methods is an impossible task be ause of
the variety of frameworks where CV an be used, whi h indu es a variety of
behaviors of CV. Nevertheless, we an still point out the three main riteria to
take into a ount for hoosing a CV method for a parti ular model sele tion
problem:

Bias: CV roughly estimates the risk of a model with a sample size nt < n
(see Se tion 5.1). Usually, this implies that CV overestimates the varian e
term ompared to the bias term in the bias-varian e de omposition (2)
with sample size n.
When the goal is estimation and the signal-to-noise ratio (SNR) is large,
the smaller bias usually is the better, whi h is obtained by taking nt n.
Otherwise, CV an be asymptoti ally suboptimal. Nevertheless, when the
goal is estimation and the SNR is small, keeping a small upward bias for
the varian e term often improves the performan e, whi h is obtained by
taking nt n with (0, 1). See Se tion 6.
When the goal is identi ation, a large bias is often needed, whi h is
obtained by taking nt n; depending on the framework, larger values of
nt an also lead to model onsisten y, see Se tion 7.

Variability: The varian e of the CV estimator of the risk is usually a


de reasing fun tion of the number B of splits, for a xed training size.
When the number of splits is xed, the variability of CV also depends
on the training sample size nt . Usually, CV is more variable when nt is
loser to n. However, when B is linked with nt (as for VFCV or LPO),
the variability of CV must be quantied pre isely, whi h has been done in
few frameworks. The only general on lusion on this point is that the CV
method with minimal variability seems strongly framework-dependent, see
Se tion 5.2 for details.

Computational omplexity: Unless losed-form formulas or analyti ap-


proximations are available (see Se tion 9), the omplexity of CV is roughly
proportional to the number of data splits: 1 for the hold-out,
 V for VFCV,
B for RLT or MCCV, n for LOO, and np for LPO.
The optimal trade-o between these three fa tors an be dierent for ea h prob-
lem, depending for instan e on the omputational omplexity of ea h estimator,
on spe i ities of the framework onsidered, and on the nal user's trade-o
between statisti al performan e and omputational ost. Therefore, no opti-
mal CV method an be pointed out before having taken into a ount the nal
user's preferen es.

30
Nevertheless, in density estimation, losed-form expressions of the LPO es-
timator have been derived by Celisse and Robin (2008) with histograms and
kernel estimators, and by Celisse (2008a) for proje tion estimators. These ex-
pressions allow to perform LPO without additional omputational ost, whi h
redu es the aforementioned trade-o to the easier bias-variability trade-o. In
parti ular, Celisse and Robin (2008) proposed to hoose p for LPO by minimiz-
ing a riterion dened as the sum of a squared bias and a varian e terms (see
also Politis et al., 1999, Chapter 9).

10.2 How the splits should be hosen?


For hold-out, VFCV, and RLT, an important question is to hoose a parti ular
sequen e of data splits.
First, should this step be random and independent from Dn , or take into
a ount some features of the problem or of the data? It is often re ommended
to take into a ount the stru ture of data when hoosing the splits. If data
are stratied, the proportions of the dierent strata should (approximately)
be the same in the sample and in ea h training and validation sample. Be-
(t)
sides, the training samples should be hosen so that s
bm (Dn ) is well dened
for every training set; in the regressogram ase, this led Arlot (2008 ) and
Arlot and Celisse (2009) to hoose arefully the splitting s heme. In supervised
lassi ation, pra titioners usually hoose the splits so that the proportion of
ea h lass is the same in every validation sample as in the sample. Neverthe-
less, Breiman and Spe tor (1992) made simulation experiments in regression for
omparing several splitting strategies. No signi ant improvement was reported
from taking into a ount the strati ation of data for hoosing the splits.
(t) (t)
Another question related to the hoi e of (Ij )1jB is whether the Ij
should be independent (like MCCV), slighly dependent (like RLT), or strongly
dependent (like VFCV). It seems intuitive that giving similar roles to all data
points in the B training and validation tasks should yield more reliable results
as other methods. This intuition may explain why VFCV is mu h more used
than RLT or MCCV. Similarly, Shao (1993) proposed a CV method alled BICV,
where every point and pair of points appear in the same number of splits, see
Se tion 4.3.2. Nevertheless, most re ent theoreti al results on the various CV
pro edures are not a urate enough to distinguish whi h one may be the best
splitting strategy: This remains a widely open theoreti al question.
Note nally that the additional variability due to the hoi e of a sequen e of
data splits was quantied empiri ally by Jonathan et al. (2000) and theoreti ally
by Celisse and Robin (2008) for VFCV.

10.3 V-fold ross-validation


VFCV is ertainly the most popular CV pro edure, in parti ular be ause of
its mild omputational ost. Nevertheless, the question of hoosing V remains
widely open, even if indi ations an be given towards an appropriate hoi e.
A spe i feature of VFCVas well as exhaustive strategiesis that hoos-
ing V uniquely determines the size of the training set nt = n(V 1)/V and
the number of splits B = V, hen e the omputational ost. Contradi tory
phenomena then o ur.

31
On the one hand, the bias of VFCV de reases with V sin e nt = n(1 1/V )
observations are used in the training set. On the other hand, the varian e of
VFCV de reases with V for small values of V, whereas the LOO (V = n) is
known to suer from a high varian e in several frameworks su h as lassi ation
or density estimation. Note however that the varian e of VFCV is minimal for
V = n in some frameworks like linear regression (see Se tion 5.2). Furthermore,
estimating the varian e of VFCV from data is a di ult problem in general, see
Se tion 5.2.3.

When the goal of model sele tion is estimation, it is often reported in the
literature that the optimal V is between 5 and 10, be ause the statisti al perfor-
man e does not in rease mu h for larger values of V , and averaging over 5 or 10
splits remains omputationally feasible (Hastie et al., 2001, Se tion 7.10). Even
if this laim is learly true for many problems, the on lusion of this survey is
that better statisti al performan e an sometimes be obtained with other values
of V, for instan e depending on the SNR value.
When the SNR is large, the asymptoti omparison of CV pro edures re-
alled in Se tion 6.2 an be trusted: LOO performs (nearly) unbiased risk es-
timation hen e is asymptoti ally optimal, whereas VFCV with V = O(1) is
suboptimal. On the ontrary, when the SNR is small, overpenalization an
improve the performan e. Therefore, VFCV with V < n an yield a smaller
risk than LOO thanks to its bias and despite its varian e when V is small (see
simulation experiments by Arlot, 2008 ). Furthermore, other CV pro edures
like RLT an be interesting alternatives to VFCV, sin e they allow to hoose
the bias (through nt ) independently from B , whi h mainly governs the varian e.
Another possible alternative is V -fold penalization, whi h is related to orre ted
VFCV (see Se tion 4.3.3).
When the goal of model sele tion is identi ation, the main drawba k of
VFCV is that nt n is often required for hoosing onsistently the true model
(see Se tion 7), whereas VFCV does not allow nt < n/2. Depending on the
frameworks, dierent (empiri al) re ommandations for hoosing V an be found
in the literature. In ordered variable sele tion, the largest V seems to be the
better, V = 10 providing results lose to the optimal ones (Zhang, 1993). On the
ontrary, Dietteri h (1998) and Alpaydin (1999) re ommend V = 2 for hoosing
the best learning pro edures among two andidates.

10.4 Future resear h


Perhaps the most important dire tion for future resear h would be to provide,
in ea h spe i framework, pre ise quantitative measures of the varian e of CV
estimators of the risk, depending on nt , the number of splits, and how the
splits are hosen. Up to now, only a few pre ise results have been obtained
in this dire tion, for some spe i CV methods in linear regression or density
estimation (see Se tion 5.2). Proving similar results in other frameworks and
for more general CV methods would greatly help to hoose a CV method for
any given model sele tion problem.
More generally, most theoreti al results are not pre ise enough to make any
distin tion between the hold-out and CV methods having the same training
sample size nt , be ause they are equivalent at rst order. Se ond order terms
do matter for realisti values of n, whi h shows the dramati need for theory

32
that takes into a ount the varian e of CV when omparing CV methods su h
as VFCV and RLT with nt = n(V 1)/V but B 6= V .

Referen es
Akaike, H. (1970). Statisti al predi tor identi ation. Ann. Inst. Statist. Math.,
22:203217.

Akaike, H. (1973). Information theory and an extension of the maximum like-


lihood prin iple. In Se ond International Symposium on Information Theory
(Tsahkadsor, 1971), pages 267281. Akadmiai Kiad, Budapest.

Allen, D. M. (1974). The relationship between variable sele tion and data aug-
mentation and a method for predi tion. Te hnometri s, 16:125127.

Alpaydin, E. (1999). Combined 5 x 2 v F test for omparing supervised las-


si ation learning algorithms. Neur. Comp., 11(8):18851892.

Anderson, R. L., Allen, D. M., and B., C. F. (1972). Sele tioon of predi tor
variables in linear multiple regression. In ban roft, T. A., editor, In Statisti al
papers in Honor of George W. Snede or. Iowa: iowa State University Press.

Arlot, S. (2007). Resampling and Model Sele tion. PhD thesis, University Paris-
Sud 11. oai:tel.ar hives-ouvertes.fr:tel-00198803_v1.

Arlot, S. (2008a). Model sele tion by resampling penalization. Ele troni Jour-
nal of Statisti s. To appear. oai:hal.ar hives-ouvertes.fr:hal-00262478_v1.

Arlot, S. (2008b). Suboptimality of penalties proportional to the dimension for


model sele tion in heteros edasti regression. arXiv:0812.3141.

Arlot, S. (2008 ). V -fold ross-validation improved: V -fold penalization.


arXiv:0802.0566v2.

Arlot, S. and Celisse, A. (2009). Segmentation in the mean of heteros edasti


data via ross-validation. arXiv:0902.3977v2.

Arlot, S. and Massart, P. (2009). Data-driven alibration of penalties for least-


squares regression. J. Ma h. Learn. Res., 10:245279 (ele troni ).

Baraud, Y. (2002). Model sele tion for regression on a random design. ESAIM
Probab. Statist., 6:127146 (ele troni ).

Barron, A., Birg, L., and Massart, P. (1999). Risk bounds for model sele tion
via penalization. Probab. Theory Related Fields, 113(3):301413.

Bartlett, P. L., Bou heron, S., and Lugosi, G. (2002). Model sele tion and error
estimation. Ma hine Learning, 48:85113.

Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Lo al Radema her


omplexities. Ann. Statist., 33(4):14971537.

Bartlett, P. L. and Mendelson, S. (2002). Radema her and Gaussian omplexi-


ties: risk bounds and stru tural results. J. Ma h. Learn. Res., 3(Spe . Issue
Comput. Learn. Theory):463482.

33
Bellman, R. E. and Dreyfus, S. E. (1962). Applied Dynami Programming.
Prin eton.

Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the varian e


of K -fold ross-validation. J. Ma h. Learn. Res., 5:10891105 (ele troni ).

Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an


autoregressive model sele ted by a generalization of Akaike's FPE riterion.
Biometrika, 64(3):547551.

Birg, L. and Massart, P. (2001). Gaussian model sele tion. J. Eur. Math. So .
(JEMS), 3(3):203268.

Birg, L. and Massart, P. (2007). Minimal penalties for Gaussian model sele -
tion. Probab. Theory Related Fields, 138(1-2):3373.

Blan hard, G. and Massart, P. (2006). Dis ussion:  Lo al Radema her omplex-
ities and ora le inequalities in risk minimization [Ann. Statist. 34 (2006), no.
6, 25932656 by V. Kolt hinskii. Ann. Statist., 34(6):26642671.

Bou heron, S., Bousquet, O., and Lugosi, G. (2005). Theory of lassi ation: a
survey of some re ent advan es. ESAIM Probab. Stat., 9:323375 (ele troni ).
Bousquet, O. and Elisse, A. (2002). Stability and Generalization. J. Ma hine
Learning Resear h, 2:499526.

Bowman, A. W. (1984). An alternative method of ross-validation for the


smoothing of density estimates. Biometrika, 71(2):353360.

Breiman, L. (1996). Heuristi s of instability and stabilization in model sele tion.


Ann. Statist., 24(6):23502383.

Breiman, L. (1998). Ar ing lassiers. Ann. Statist., 26(3):801849. With


dis ussion and a rejoinder by the author.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Clas-
si ation and regression trees. Wadsworth Statisti s/Probability Series.
Wadsworth Advan ed Books and Software, Belmont, CA.

Breiman, L. and Spe tor, P. (1992). Submodel sele tion and evaluation in re-
gression. the x-random ase. International Statisti al Review, 60(3):291319.

Burman, P. (1989). A omparative study of ordinary ross-validation, v-


fold ross-validation and the repeated learning-testing methods. Biometrika,
76(3):503514.

Burman, P. (1990). Estimation of optimal transformations using v -fold ross


validation and repeated learning-testing methods. Sankhy
a Ser. A, 52(3):314
345.

Burman, P., Chow, E., and Nolan, D. (1994). A ross-validatory method for
dependent data. Biometrika, 81(2):351358.

Burman, P. and Nolan, D. (1992). Data-dependent estimation of predi tion


fun tions. J. Time Ser. Anal., 13(3):189207.

34
Burnham, K. P. and Anderson, D. R. (2002). Model sele tion and multimodel in-
feren e. Springer-Verlag, New York, se ond edition. A pra ti al information-
theoreti approa h.

Cao, Y. and Golubev, Y. (2006). On ora le inequalities related to smoothing


splines. Math. Methods Statist., 15(4):398414.

Celisse, A. (2008a). Density estimation via ross-validation: Model sele tion


point of view. Te hni al report, arXiv: 0811.0802.

Celisse, A. (2008b). Model Sele tion Via Cross-Validation in Density Estima-


tion, Regression and Change-Points Dete tion. PhD thesis, University Paris-
Sud 11, http://tel.ar hives-ouvertes.fr/tel-00346320/en/.

Celisse, A. and Robin, S. (2008). Nonparametri density estimation by ex-


a t leave-p-out ross-validation. Computational Statisti s and Data Analysis,
52(5):23502368.

Chow, Y. S., Geman, S., and Wu, L. D. (1987). Consistent ross-validated


density estimation. Ann. Statist., 11:2538.

Chu, C.-K. and Marron, J. S. (1991). Comparison of two bandwidth sele tors
with dependent errors. Ann. Statist., 19(4):19061918.

Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline fun tions.
Estimating the orre t degree of smoothing by the method of generalized
ross-validation. Numer. Math., 31(4):377403.

Dalelane, C. (2005). Exa t ora le inequality for sharp adaptive kernel density
estimator. Te hni al report, arXiv.

Daudin, J.-J. and Mary-Huard, T. (2008). Estimation of the onditional risk in


lassi ation: The swapping method. Comput. Stat. Data Anal., 52(6):3220
3232.

Davison, A. C. and Hall, P. (1992). On the bias and variability of boot-


strap and ross-validation estimates of error rate in dis rimination problems.
Biometrika, 79(2):279284.

Devroye, L., Gyr, L., and Lugosi, G. (1996). A probabilisti theory of pattern
re ognition, volume 31 of Appli ations of Mathemati s (New York). Springer-
Verlag, New York.

Devroye, L. and Wagner, T. J. (1979). Distribution-Free performan e Bounds


for Potential Fun tion Rules. IEEE Transa tion in Information Theory,
25(5):601604.

Dietteri h, T. G. (1998). Approximate statisti al tests for omparing supervised


lassi ation learning algorithms. Neur. Comp., 10(7):18951924.

Efron, B. (1983). Estimating the error rate of a predi tion rule: improvement
on ross-validation. J. Amer. Statist. Asso ., 78(382):316331.

Efron, B. (1986). How biased is the apparent error rate of a predi tion rule? J.
Amer. Statist. Asso ., 81(394):461470.

35
Efron, B. (2004). The estimation of predi tion error: ovarian e penalties and
ross-validation. J. Amer. Statist. Asso ., 99(467):619642. With omments
and a rejoinder by the author.

Efron, B. and Morris, C. (1973). Combining possibly related estimation prob-


lems (with dis ussion). J. R. Statist. So . B, 35:379.

Efron, B. and Tibshirani, R. (1997). Improvements on ross-validation: the


.632+ bootstrap method. J. Amer. Statist. Asso ., 92(438):548560.

Fromont, M. (2007). Model sele tion by bootstrap penalization for lassi ation.
Ma h. Learn., 66(23):165207.

Geisser, S. (1974). A predi tive approa h to the random ee t model.


Biometrika, 61(1):101107.

Geisser, S. (1975). The predi tive sample reuse method with appli ations. J.
Amer. Statist. Asso ., 70:320328.

Girard, D. A. (1998). Asymptoti omparison of (partial) ross-validation, GCV


and randomized GCV in nonparametri regression. Ann. Statist., 26(1):315
334.

Grnwald, P. D. (2007). The Minimum Des ription Length Prin iple. MIT
Press, Cambridge, MA, USA.

Gyr, L., Kohler, M., Krzyak, A., and Walk, H. (2002). A distribution-free
theory of nonparametri regression. Springer Series in Statisti s. Springer-
Verlag, New York.

Hall, P. (1983). Large sample optimality of least squares ross-validation in


density estimation. Ann. Statist., 11(4):11561174.

Hall, P. (1987). On Kullba k-Leibler loss and density estimation. The Annals
of Statisti s, 15(4):14911519.

Hall, P., Lahiri, S. N., and Polzehl, J. (1995). On bandwidth hoi e in nonpara-
metri regression with both short- and long-range dependent errors. Ann.
Statist., 23(6):19211936.

Hall, P., Marron, J. S., and Park, B. U. (1992). Smoothed ross-validation.


Probab. Theory Related Fields, 92(1):120.

Hall, P. and S hu any, W. R. (1989). A lo al ross-validation algorithm. Statist.


Probab. Lett., 8(2):109117.

Hrdle, W. (1984). How to determine the bandwidth of some nonlinear


smoothers in pra ti e. In Robust and nonlinear time series analysis (Heidel-
berg, 1983), volume 26 of Le ture Notes in Statist., pages 163184. Springer,
New York.

Hrdle, W., Hall, P., and Marron, J. S. (1988). How far are automati ally hosen
regression smoothing parameters from their optimum? J. Amer. Statist. As-
so ., 83(401):86101. With omments by David W. S ott and Iain Johnstone
and a reply by the authors.

36
Hart, J. D. (1994). Automated kernel smoothing of dependent data by using
time series ross-validation. J. Roy. Statist. So . Ser. B, 56(3):529542.

Hart, J. D. and Vieu, P. (1990). Data-driven bandwidth hoi e for density


estimation based on dependent data. Ann. Statist., 18(2):873890.

Hart, J. D. and Wehrly, T. E. (1986). Kernel regression estimation using re-


peated measurements data. J. Amer. Statist. Asso ., 81(396):10801088.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statisti-
al learning. Springer Series in Statisti s. Springer-Verlag, New York. Data
mining, inferen e, and predi tion.

Herzberg, A. M. and Tsukanov, A. V. (1986). A note on modi ations of ja k-


knife riterion for model sele tion. Utilitas Math., 29:209216.

Herzberg, P. A. (1969). The parameters of ross-validation. Psy hometrika,


34:Monograph Supplement.

Hesterberg, T. C., Choi, N. H., Meier, L., and Fraley, C. (2008). Least angle and
l1 penalized regression: A review. Statisti s Surveys, 2:6193 (ele troni ).

Hills, M. (1966). Allo ation Rules and their Error Rates. J. Royal Statist. So .
Series B, 28(1):131.

Huber, P. (1964). Robust estimation of a lo al parameter. Ann. Math. Statist.,


35:73101.

Hurvi h, C. M. and Tsai, C.-L. (1989). Regression and time series model sele -
tion in small samples. Biometrika, 76(2):297307.

John, P. W. M. (1971). Statisti al design and analysis of experiments. The


Ma millan Co., New York.

Jonathan, P., Krzanowki, W. J., and M Carthy, W. V. (2000). On the use of


ross-validation to assess performan e in multivariate predi tion. Stat. and
Comput., 10:209229.

Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Experimental
and Theoreti al Comparison of Model Sele tion Methods. Ma hine Learning,
27:750.

Kearns, M. and Ron, D. (1999). Algorithmi Stability and Sanity-Che k Bounds


for Leave-One-Out Cross-Validation. Neural Computation, 11:14271453.

Kolt hinskii, V. (2001). Radema her penalties and stru tural risk minimization.
IEEE Trans. Inform. Theory, 47(5):19021914.

Kolt hinskii, V. (2006). Lo al Radema her omplexities and ora le inequalities


in risk minimization. Ann. Statist., 34(6):25932656.

La henbru h, P. A. and Mi key, M. R. (1968). Estimation of Error Rates in


Dis riminant Analysis. Te hnometri s, 10(1):111.

Larson, S. C. (1931). The shrinkage of the oe ient of multiple orrelation. J.


Edi . Psy hol., 22:4555.

37
Leung, D., Marriott, F., and Wu, E. (1993). Bandwidth sele tion in robust
smoothing. J. Nonparametr. Statist., 2:333339.

Leung, D. H.-Y. (2005). Cross-validation in nonparametri regression with out-


liers. Ann. Statist., 33(5):22912310.

Li, K.-C. (1985). From stein's unbiased risk estimates to the method of gener-
alized ross validation. Ann. Statist., 13(4):13521377.

Li, K.-C. (1987). Asymptoti optimality for Cp , CL , ross-validation and gen-


eralized ross-validation: dis rete index set. Ann. Statist., 15(3):958975.
Mallows, C. L. (1973). Some omments on Cp . Te hnometri s, 15:661675.

Markatou, M., Tian, H., Biswas, S., and Hrip sak, G. (2005). Analysis of
varian e of ross-validation estimators of the generalization error. J. Ma h.
Learn. Res., 6:11271168 (ele troni ).

Massart, P. (2007). Con entration inequalities and model sele tion, volume 1896
of Le ture Notes in Mathemati s. Springer, Berlin. Le tures from the 33rd
Summer S hool on Probability Theory held in Saint-Flour, July 623, 2003,
With a foreword by Jean Pi ard.

Molinaro, A. M., Simon, R., and Pfeier, R. M. (2005). Predi tion error estima-
tion: a omparison of resampling methods. Bioinformati s, 21(15):33013307.
Mosteller, F. and Tukey, J. W. (1968). Data analysis, in luding statisti s. In
Lindzey, G. and Aronson, E., editors, Handbook of So ial Psy hology, Vol. 2.
Addison-Wesley.

Nadeau, C. and Bengio, Y. (2003). Inferen e for the generalization error. Ma-
hine Learning, 52:239281.

Nishii, R. (1984). Asymptoti properties of riteria for sele tion of variables in


multiple regression. Ann. Statist., 12(2):758765.

Opsomer, J., Wang, Y., and Yang, Y. (2001). Nonparametri regression with
orrelated errors. Statist. S i., 16(2):134153.

Pi ard, R. R. and Cook, R. D. (1984). Cross-validation of regression models. J.


Amer. Statist. Asso ., 79(387):575583.

Politis, D. N., Romano, J. P., and Wolf, M. (1999). Subsampling. Springer


Series in Statisti s. Springer-Verlag, New York.

Quenouille, M. H. (1949). Approximate tests of orrelation in time-series. J.


Roy. Statist. So . Ser. B., 11:6884.

Ripley, B. D. (1996). Pattern Re ognition and Neural Networks. Cambridge


Univ. Press.

Rissanen, J. (1983). Universal Prior for Integers and Estimation by Minimum


Des ription Length. The Annals of Statisti s, 11(2):416431.

Ron hetti, E., Field, C., and Blan hard, W. (1997). Robust linear model sele -
tion by ross-validation. J. Amer. Statist. Asso ., 92:10171023.

38
Rudemo, M. (1982). Empiri al Choi e of Histograms and Kernel Density Esti-
mators. S andinavian Journal of Statisti s, 9:6578.

Sauv, M. (2009). Histogram sele tion in non gaussian regression. ESAIM:


Probability and Statisti s, 13:7086.

S huster, E. F. and Gregory, G. G. (1981). On the onsisten y of maximum like-


lihood nonparametri density estimators. In Eddy, W. F., editor, Computer
S ien e and Statisti s: Pro eedings of the 13th Symposium on the Interfa e,
pages 295298. Springer-Verlag, New York.

S hwarz, G. (1978). Estimating the dimension of a model. Ann. Statist.,


6(2):461464.

Shao, J. (1993). Linear model sele tion by ross-validation. J. Amer. Statist.


Asso ., 88(422):486494.

Shao, J. (1996). Bootstrap model sele tion. J. Amer. Statist. Asso .,


91(434):655665.

Shao, J. (1997). An asymptoti theory for linear model sele tion. Statist. Sini a,
7(2):221264. With omments and a rejoinder by the author.

Shibata, R. (1984). Approximate e ien y of a sele tion pro edure for the
number of regression variables. Biometrika, 71(1):4349.

Simon, F. (1971). Predi tion methods in riminology. volume 7.

Stone, C. (1984). An asymptoti ally optimal window sele tion rule for kernel
density estimates. The Annals of Statisti s, 12(4):12851297.

Stone, M. (1974). Cross-validatory hoi e and assessment of statisti al predi -


tions. J. Roy. Statist. So . Ser. B, 36:111147. With dis ussion by G. A.
Barnard, A. C. Atkinson, L. K. Chan, A. P. Dawid, F. Downton, J. Di key,
A. G. Baker, O. Barndor-Nielsen, D. R. Cox, S. Giesser, D. Hinkley, R. R.
Ho king, and A. S. Young, and with a reply by the authors.

Stone, M. (1977). Asymptoti s for and against ross-validation. Biometrika,


64(1):2935.

Sugiura, N. (1978). Further analysis of the data by akaike's information riterion


and the nite orre tions. Comm. Statist. ATheory Methods, 7(1):1326.

Tibshirani, R. (1996). Regression Shrinkage and Sele tion via the Lasso. J.
Royal Statist. So . Series B, 58(1):267288.

van der Laan, M. J. and Dudoit, S. (2003). Unied ross-validation methodology


for sele tion among estimators and a general ross-validated adaptive epsilon-
net estimator: Finite sample ora le inequalities and examples. Working Paper
Series Working Paper 130, U.C. Berkeley Division of Biostatisti s. available
at http://www.bepress. om/u bbiostat/paper130.

van der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptoti optimality
of likelihood-based ross-validation. Stat. Appl. Genet. Mol. Biol., 3:Art. 4,
27 pp. (ele troni ).

39
van der Laan, M. J., Dudoit, S., and van der Vaart, A. W. (2006). The ross-
validated adaptive epsilon-net estimator. Statist. De isions, 24(3):373395.

van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Ora le
inequalities for multi-fold ross validation. Statist. De isions, 24(3):351371.

van Erven, T., Grnwald, P. D., and de Rooij, S. (2008). Cat hing up
faster by swit hing sooner: A prequential solution to the ai -bi dilemma.
arXiv:0807.1005.

Vapnik, V. (1982). Estimation of dependen es based on empiri al data. Springer


Series in Statisti s. Springer-Verlag, New York. Translated from the Russian
by Samuel Kotz.

Vapnik, V. N. (1998). Statisti al learning theory. Adaptive and Learning Sys-


tems for Signal Pro essing, Communi ations, and Control. John Wiley & Sons
In ., New York. A Wiley-Inters ien e Publi ation.

Vapnik, V. N. and Chervonenkis, A. Y. (1974). Teoriya raspoznavaniya obra-


zov. Statisti heskie problemy obu heniya. Izdat. Nauka, Mos ow. Theory of
Pattern Re ognition (In Russian).

Wahba, G. (1975). Periodi splines for spe tral density estimation: The use of
ross validation for determining the degree of smoothing. Communi ations in
Statisti s, 4:125142.

Wahba, G. (1977). Pra ti al Approximate Solutions to Linear Operator Equa-


tions When the Data are Noisy. SIAM Journal on Numeri al Analysis,
14(4):651667.

Wegkamp, M. (2003). Model sele tion in nonparametri regression. The Annals


of Statisti s, 31(1):252273.

Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A oni t be-
tween model indenti ation and regression estimation. Biometrika, 92(4):937
950.

Yang, Y. (2006). Comparing learning methods for lassi ation. Statist. Sini a,
16(2):635657.

Yang, Y. (2007). Consisten y of ross validation for omparing regression pro-


edures. Ann. Statist., 35(6):24502473.

Zhang, P. (1993). Model sele tion via multifold ross validation. Ann. Statist.,
21(1):299313.

40

You might also like