You are on page 1of 31

C

p
, AIC, BIC
Three Criteria for Model Selection
Rachel Fan
Statistics Department
Columbia University
September 16, 2011
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 1 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 2 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 3 / 31
Linear Model
Y = X +
Y
n1
= (y
1
, , y
n
)
T
X
n(k+1)
= (1, x
1
, , x
k
)

(k+1)1
= (
0
, ,
k
)

n1
= (
1
, ,
n
), indepedence, E
i
= 0, V(
i
) =
2
EY = X
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 4 / 31
Notation
K
+
= {0, 1, , k}, the set of indices
P K
+
; Q = K
+
\ P
|P| : the number of elements in P, let |P| = p, |Q| = q, so
p + q = k + 1

P
= X

P
Y: The LS estimator of using the subset of X with
indices in P
X

P
has zeroes in the rows corresponding to Q, and the remaining
rows contain the matrix (Z
T
P
Z
P
)
1
Z
T
P
, where Z
P
is obtained from X
by deleting the columns corresponding to Q

Y
P
= X

P
= XX

p
Y = H
p
Y
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 5 / 31
Scaled sum of squared errors

P
=
||

Y
P
EY||
2

2
a measure of prediction adequacy
E||

Y
P
EY||
2
= V
P
+ B
P
V
P
=tr(V(

Y)) =tr(H
p
)
2
= p
2
B
P
= ||EY E

Y
P
||
2
=
T
X
T
(I H
P
)X
E
P
= p +
B
p

2
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 6 / 31
Mallows C
p
C
P
:=
SSE
P
MSE
k+1
n + 2p
SSE
P
= ||

Y
P
Y||
2
E(SSE
P
) = (n p)
2
+ B
p
MSE
P
= SSE
P
/(n p)
If the full model contains all relevant variables, E(MSE
k+1
) =
2
E(C
P
) = p +
B
p

2
= E
P
C
P
is an estimate of
p
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 7 / 31
Properties of Mallows C
p
C
P
:=
SSE
p
MSE
k+1
n + 2p
If the P-subset model is adequate, SSE
P
(n p)
2
and C
P
p
C
K
+ = (n k 1) n + 2(k + 1) = k + 1
If |P

| = p + 1 and P P

, then
C
P
C
P
= 2
SS
MSE
k+1
where SS is the contribution to SSR by the (p + 1)
th
variable
Assume
i
N(0,
2
),
SS
MSE
k+1
t
2
1
If the additional variable is unimportant, then B
P
B
P
, E(SS)
2
and so E(C
P
C
P
) 1
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 8 / 31
C
P
plot
Figure: C
P
plot with independent variables P is an adequate subset ( =
p
),
p = k 2, K
+
\ P = {1, 2, 3}, which are unimportant. Every non-zero element of
is signicant.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 9 / 31
C
P
plot
Figure: Variables 1, 2, 3 are highly explanatory and highly correlated, and variables
in P are also explanatory
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 10 / 31
C
P
plot
Figure: C
p
plotVariables 1, 2 are jointly explanatory but separately not, and
variables in P are also explanatory where |P| = k 4.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 11 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 12 / 31
Model
x
1
, x
2
, , x
n

iid
g(x)
True model: g(x)
Candidate parametric model : f (x|),
p
MLE of :


p
Fitted model: f (x|

)
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 13 / 31
Kullback-Leibler Information
I(g; f (|)) := E(log
g(x)
f (x|)
) = S(g; g) S(g; f (|))
S(g; f (|)) = E(log(f (x|))) =
_
log f (x|)g(x)dx
I represents the separation between g and f
I 0 and the equality holds i f = g a.e.
The best tting model minimizes I(g; f (|)), or equivalently,
maximizes S(g; f (|))
S(g; f (|)) cannot be computed since g(x) is unknown.
By SLLN,
1
n
n

i =1
log f (x
i
|) S(g; f (|))
Maximizing the average log-likelihood leads to the MLE

Rachel Fan (Statistics Department Columbia University) C


p
, AIC, BIC September 16, 2011 14 / 31
Derivation of AIC
Consider g(x) = f (x|
0
),
0

k
Denote I(g; f (|)) and S(g; f (|)) by I(
0
, ) and S(
0
, )
Assume

0
/
p
(p < k), and

= argmax

p
(S(
0
, ))
Suppose

is suciently close to
0
Try to nd the model that maximizes E(I(
0
,

))
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 15 / 31
Derivation of AIC
E(2nI(
0
,

)) = E
_
2n(S(
0
,
0
) S(
0
,

))
_
= E
_
2n(S(
0
,
0
) S(
0
,

) + S(
0
,

) S(
0
,

))
_
= E
_
n


0

2
I
+n


2
I
_
n


0

2
I
+p
where


0

2
I
= (


0
)
T
I (
0
)(


0
),


2
I
= (

)
T
I (

)(

)
The last limit follows from

n(

)
d
N(0, I
1
(

))
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 16 / 31
Estimate n


0

2
I
by
2
_
n

i =1
log f (x
i
|
0
)
n

i =1
log f (x
i
|

)
_
Needs bias correction by adding p
Therefore, an asymptotically unbiased estimator of E(2nI(
0
,

)) is:
2
_
n

i =1
log f (x
i
|
0
)
n

i =1
log f (x
i
|

)
_
+ 2p
Minimizing EI(
0
,

) is equivalent to minimizing
AIC := 2 log(maximum likelihood) + 2p
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 17 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 18 / 31
Model
x
1
, x
2
, , x
n

iid
f (x|)
True model: f (x|) = exp{ y(x) b()} where
K
Candidate models: M
1
, M
2
, , M
2
k where M
j
is a k
j
dimensional
linear submanifold of K-dimensional space
Prior : P(M = M
j
) =
j
;
j
|M
j

j
Posterior: f (M
j
,
j
|x) =
f (x|
j
)
j
d
j
(
j
)
m(x)
where m(x) is the marginal
density of x
f (M
j
|x) =


j
f (x|
j
)d
j
(
j
)
m(x)
Find the model that gives the highest posterior density.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 19 / 31
BIC
S(Y, n, j ) := log
_

j
exp(n(Y b())d
j
()
Proposition: For xed Y and j , as n ,
S(Y, n, j ) = n sup(Y b())
1
2
k
j
log n + R
where the remainder R = R(Y, n, j ) is bounded in n for xed Y and j .
BIC := 2 log(maximum likelihood) + p log n
p is the number of parameters in the candidate model
Select the model that minimizes BIC (maximizes S(Y, n, j ))
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 20 / 31
BIC
Consider the general situation that f (x|) does not have to be
exponential distribution
2 log f (M
j
|x) = 2 log m(x) 2 log
j
2 log
_
f (x|
j
)d
j
(
j
)
. .
S

(M
j
|x)
Select the model that minimizes S

(M
j
|x)
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 21 / 31
BIC
By second order Taylor expansion
log f (x|
j
) = log L(
j
) log L(

j
)
1
2
(
j

j
)
T
[n

I (

j
)](
j

j
)
where

I (

j
) =
1
n

2
log L(
j
)

2
j

j
=

j
If we have noninformative prior d
j
(
j
) = d
j
_
L(
j
)d
j
(
j
) L(

j
)
_
exp
_

1
2
(
j

j
)
T
[n

I (

j
)](
j

j
)
_
d
j
= L(

j
)(
2
n
)
k
j
2
|

I (

j
)|

1
2
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 22 / 31
BIC
Therefore,
S

(M
j
|x) 2 log
j
2 log L(

j
)(
2
n
)
k
j
2
|

I (

j
)|

1
2
= 2 log
j
2 log L(

j
) + k
j
log
n
2
+ log |

I (

j
)|
Ignoring the terms that are bounded when n goes to , we obtain
S

(M
j
|x) 2 log L(

j
) + k
j
log n =: BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 23 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 24 / 31
Comparison between AIC and BIC
AIC = 2 log(maximum likelihood) + 2p
BIC = 2 log(maximum likelihood) + p log n
BIC is consistent yet not asymptotically ecient, AIC is
asymptotically ecient, but not consistent.
Consistency: Suppose that the true model is of a nite dimension,
and that this model is nested in the candidate collection under
consideration. A consistent criterion will asymptotically select the
true model with probability one.
Eciency: Suppose the true model is of an innite dimension and
therefore lies outside of the candidate collection under consideration.
An asymptotically ecient criterion will asymptotically select the
model which minimizes the mean squared error of prediction.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 25 / 31
Comparison between AIC and BIC
AIC = 2 log(maximum likelihood) + 2p
BIC = 2 log(maximum likelihood) + p log n
The penalty term of BIC is more stringent than the penalty term of
AIC (For n 8, p log n is greater than 2p)
Therefore, BIC favors smaller models than AIC
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 26 / 31
Reference
Mallows, C.L. (1973), Some Comments on CP, Technometrics
15(4): 661-675.
Akaike, H.(1974), A New Look at the Statistical Model
Identication, IEEE Transactions on Automatic Control AC-19(46:
716-723
Schwarz G.(1978), Estimating the Dimension of a ModelThe
Annals of Statistics 6(2):461-464
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 27 / 31
Proof of Proposition
Proposition: For xed Y and j , as n ,
S(Y, n, j ) = n sup(Y b())
1
2
k
j
log n + R
where the remainder R = R(Y, n, j ) is bounded as n goes to for xed Y
and j .
Lemma 1. The proposition holds when Y b() = A
0

2
where > 0,
0
is a xed vector in m
j
and
j
is the Lebesgue measure on
m
j
.
Lemma 2. If two bounded positive random variables U and V agree on
the set where either exceeds for some 0 < < sup U, then
log E(U
n
) log E(V
n
) 0 as n
Lemma 3. For some 0 < < e
A
where A = sup(Y b()), a vector

0
, and some positive
1
and
2
, the following holds wherever
exp(Y b()) > :
A
1

0

2
< Y b() < A
2

0

2
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 28 / 31
Proof of Lemma 1
Lemma 1. The proposition holds when Y b() = A
0

2
where > 0,
0
is a xed vector in m
j
and
j
is the Lebesgue measure on
m
j
.
Proof: Since
S(Y, n, j ) = log(
j
e
nA
(

n
)
k
j
/2
)
= nA
1
2
k
j
log
n

+ log
j
And sup(A
0

2
) = A
This establish the proposition, with R =
1
2
k
j
log

+ log
j
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 29 / 31
Proof of Lemma 2
Lemma 2. If two bounded positive random variables U and V agree on
the set where either exceeds for some 0 < < sup U, then
log E(U
n
) log E(V
n
) 0 as n
Proof: It suces to show that this holds for V that vanishes where U
In this case 0 U
n
V
n
and therefore
E(V
n
) E(U
n
) E(V
n
) +
n
= E(V
n
)
_
1 +

n
E(V
n
)
_
From (E(V
n
))
1/n
sup V and sup V = sup U , we know

E(V
n
)
1/n
< 1
in limits and hence

n
E(V
n
)
0, and hence log(1 +

n
E(V
n
)
) 0, which
establishes the result.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 30 / 31
Proof of Lemma 3
Lemma 3. For some 0 < < e
A
where A = sup(Y b()), a vector

0
, and some positive
1
and
2
, the following holds wherever
exp(Y b()) > :
A
1

0

2
< Y b() < A
2

0

2
Proof: As

b() = COV(y), it is positive denite. Therefore, Y b()
is strictly convex and attains its maximum. Let
0
= argmax(Y b()).
The second order Taylor expansion at
0
yields:
Y b() = A +
1
2
(
0
)
T

b(
0
)(
0
)
If 2
1
and 2
2
are the larger and smaller than all the eigenvalues of

b(
0
).
It is now easy to determine < e
A
so that it will bound exp(Y b())
outside the neighborhood.
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 31 / 31

You might also like