CP, AIC, BIC-Three Critera For Selecting Model

C
p
, AIC, BIC
Three Criteria for Model Selection
Rachel Fan
Statistics Department
Columbia University
September 16, 2011
Rachel Fan (Statistics Department Columbia University) C
p
, AIC, BIC September 16, 2011 1 / 31
Outline
1
Mallows C
p
2
Akaike Information Criteria (AIC)
3
Bayesian Information Criteria (BIC)
4
Comparison between AIC and BIC
p
Outline
1
Mallows C
p
2
3
4
p
Linear Model
Y = X +
Y
n1
= (y
1
, , y
n
)
T
X
n(k+1)
= (1, x
1
, , x
k
)
(k+1)1
= (
0
, ,
k
)
n1
= (
1
, ,
n
), indepedence, E
i
= 0, V(
i
) =
2
EY = X
p
Notation
K
+
= {0, 1, , k}, the set of indices
P K
+
; Q = K
+
\ P
|P| : the number of elements in P, let |P| = p, |Q| = q, so
p + q = k + 1
P
= X
P
Y: The LS estimator of using the subset of X with
indices in P
X
P
has zeroes in the rows corresponding to Q, and the remaining
rows contain the matrix (Z
T
P
Z
P
)
1
Z
T
P
, where Z
P
is obtained from X
by deleting the columns corresponding to Q
Y
P
= X
P
= XX
p
Y = H
p
Y
p
Scaled sum of squared errors
P
=
||
Y
P
EY||
2
2
a measure of prediction adequacy
E||
Y
P
EY||
2
= V
P
+ B
P
V
P
=tr(V(
Y)) =tr(H
p
)
2
= p
2
B
P
= ||EY E
Y
P
||
2
=
T
X
T
(I H
P
)X
E
P
= p +
B
p
2
p
Mallows C
p
C
P
:=
SSE
P
MSE
k+1
n + 2p
SSE
P
= ||
Y
P
Y||
2
E(SSE
P
) = (n p)
2
+ B
p
MSE
P
= SSE
P
/(n p)
If the full model contains all relevant variables, E(MSE
k+1
) =
2
E(C
P
) = p +
B
p
2
= E
P
C
P
is an estimate of
p
p
Properties of Mallows C
p
C
P
:=
SSE
p
MSE
k+1
n + 2p
If the P-subset model is adequate, SSE
P
(n p)
2
and C
P
p
C
K
+ = (n k 1) n + 2(k + 1) = k + 1
If |P
| = p + 1 and P P
, then
C
P
C
P
= 2
SS
MSE
k+1
where SS is the contribution to SSR by the (p + 1)
th
variable
Assume
i
N(0,
2
),
SS
MSE
k+1
t
2
1
If the additional variable is unimportant, then B
P
B
P
, E(SS)
2
and so E(C
P
C
P
) 1
p
C
P
plot
Figure: C
P
plot with independent variables P is an adequate subset ( =
p
),
p = k 2, K
+
\ P = {1, 2, 3}, which are unimportant. Every non-zero element of
is signicant.
p
C
P
plot
Figure: Variables 1, 2, 3 are highly explanatory and highly correlated, and variables
in P are also explanatory
p
C
P
plot
Figure: C
p
plotVariables 1, 2 are jointly explanatory but separately not, and
variables in P are also explanatory where |P| = k 4.
p
Outline
1
Mallows C
p
2
3
4
p
Model
x
1
, x
2
, , x
n

iid
g(x)
True model: g(x)
Candidate parametric model : f (x|),
p
MLE of :

p
Fitted model: f (x|
)
p
Kullback-Leibler Information
I(g; f (|)) := E(log
g(x)
f (x|)
) = S(g; g) S(g; f (|))
S(g; f (|)) = E(log(f (x|))) =
_
log f (x|)g(x)dx
I represents the separation between g and f
I 0 and the equality holds i f = g a.e.
The best tting model minimizes I(g; f (|)), or equivalently,
maximizes S(g; f (|))
S(g; f (|)) cannot be computed since g(x) is unknown.
By SLLN,
1
n
n
i =1
log f (x
i
|) S(g; f (|))
Maximizing the average log-likelihood leads to the MLE


p
Derivation of AIC
Consider g(x) = f (x|
0
),
0

k
Denote I(g; f (|)) and S(g; f (|)) by I(
0
, ) and S(
0
, )
Assume

0
/
p
(p < k), and

= argmax
p
(S(
0
, ))
Suppose

is suciently close to
0
Try to nd the model that maximizes E(I(
0
,
))
p
Derivation of AIC
E(2nI(
0
,
)) = E
_
2n(S(
0
,
0
) S(
0
,
))
_
= E
_
2n(S(
0
,
0
) S(
0
,
) + S(
0
,
) S(
0
,
))
_
= E
_
n

0

2
I
+n

2
I
_
n

0

2
I
+p
where

0

2
I
= (

0
)
T
I (
0
)(

0
),

2
I
= (
)
T
I (
)(
)
The last limit follows from

n(
)
d
N(0, I
1
(
))
p
Estimate n

0

2
I
by
2
_
n
i =1
log f (x
i
|
0
)
n
i =1
log f (x
i
|
)
_
Needs bias correction by adding p
Therefore, an asymptotically unbiased estimator of E(2nI(
0
,
)) is:
2
_
n
i =1
log f (x
i
|
0
)
n
i =1
log f (x
i
|
)
_
+ 2p
Minimizing EI(
0
,
) is equivalent to minimizing
AIC := 2 log(maximum likelihood) + 2p
p
Outline
1
Mallows C
p
2
3
4
p
Model
x
1
, x
2
, , x
n

iid
f (x|)
True model: f (x|) = exp{ y(x) b()} where
K
Candidate models: M
1
, M
2
, , M
2
k where M
j
is a k
j
dimensional
linear submanifold of K-dimensional space
Prior : P(M = M
j
) =
j
;
j
|M
j

j
Posterior: f (M
j
,
j
|x) =
f (x|
j
)
j
d
j
(
j
)
m(x)
where m(x) is the marginal
density of x
f (M
j
|x) =

j
f (x|
j
)d
j
(
j
)
m(x)
Find the model that gives the highest posterior density.
p
BIC
S(Y, n, j ) := log
_

j
exp(n(Y b())d
j
()
Proposition: For xed Y and j , as n ,
S(Y, n, j ) = n sup(Y b())
1
2
k
j
log n + R
where the remainder R = R(Y, n, j ) is bounded in n for xed Y and j .
BIC := 2 log(maximum likelihood) + p log n
p is the number of parameters in the candidate model
Select the model that minimizes BIC (maximizes S(Y, n, j ))
p
BIC
Consider the general situation that f (x|) does not have to be
exponential distribution
2 log f (M
j
|x) = 2 log m(x) 2 log
j
2 log
_
f (x|
j
)d
j
(
j
)
. .
S
(M
j
|x)
Select the model that minimizes S
(M
j
|x)
p
BIC
By second order Taylor expansion
log f (x|
j
) = log L(
j
) log L(
j
)
1
2
(
j

j
)
T
[n
I (
j
)](
j

j
)
where

I (

j
) =
1
n
2
log L(
j
)
2
j
j
=
j
If we have noninformative prior d
j
(
j
) = d
j
_
L(
j
)d
j
(
j
) L(
j
)
_
exp
_
1
2
(
j

j
)
T
[n
I (
j
)](
j

j
)
_
d
j
= L(
j
)(
2
n
)
k
j
2
|
I (
j
)|
1
2
p
BIC
Therefore,
S
(M
j
|x) 2 log
j
2 log L(
j
)(
2
n
)
k
j
2
|
I (
j
)|
1
2
= 2 log
j
2 log L(
j
) + k
j
log
n
2
+ log |
I (
j
)|
Ignoring the terms that are bounded when n goes to , we obtain
S
(M
j
|x) 2 log L(
j
) + k
j
log n =: BIC
p
Outline
1
Mallows C
p
2
3
4
p
AIC = 2 log(maximum likelihood) + 2p
BIC = 2 log(maximum likelihood) + p log n
BIC is consistent yet not asymptotically ecient, AIC is
asymptotically ecient, but not consistent.
Consistency: Suppose that the true model is of a nite dimension,
and that this model is nested in the candidate collection under
consideration. A consistent criterion will asymptotically select the
true model with probability one.
Eciency: Suppose the true model is of an innite dimension and
therefore lies outside of the candidate collection under consideration.
An asymptotically ecient criterion will asymptotically select the
model which minimizes the mean squared error of prediction.
p
AIC = 2 log(maximum likelihood) + 2p
BIC = 2 log(maximum likelihood) + p log n
The penalty term of BIC is more stringent than the penalty term of
AIC (For n 8, p log n is greater than 2p)
Therefore, BIC favors smaller models than AIC
p
Reference
Mallows, C.L. (1973), Some Comments on CP, Technometrics
15(4): 661-675.
Akaike, H.(1974), A New Look at the Statistical Model
Identication, IEEE Transactions on Automatic Control AC-19(46:
716-723
Schwarz G.(1978), Estimating the Dimension of a ModelThe
Annals of Statistics 6(2):461-464
p
Proof of Proposition
Proposition: For xed Y and j , as n ,
S(Y, n, j ) = n sup(Y b())
1
2
k
j
log n + R
where the remainder R = R(Y, n, j ) is bounded as n goes to for xed Y
and j .
Lemma 1. The proposition holds when Y b() = A
0

2
where > 0,
0
is a xed vector in m
j
and
j
is the Lebesgue measure on
m
j
.
Lemma 2. If two bounded positive random variables U and V agree on
the set where either exceeds for some 0 < < sup U, then
log E(U
n
) log E(V
n
) 0 as n
Lemma 3. For some 0 < < e
A
where A = sup(Y b()), a vector
0
, and some positive
1
and
2
, the following holds wherever
exp(Y b()) > :
A
1

0

2
< Y b() < A
2

0

2
p
Proof of Lemma 1
Lemma 1. The proposition holds when Y b() = A
0

2
where > 0,
0
is a xed vector in m
j
and
j
is the Lebesgue measure on
m
j
.
Proof: Since
S(Y, n, j ) = log(
j
e
nA
(

n
)
k
j
/2
)
= nA
1
2
k
j
log
n
+ log
j
And sup(A
0

2
) = A
This establish the proposition, with R =
1
2
k
j
log

+ log
j
p
Proof of Lemma 2
Lemma 2. If two bounded positive random variables U and V agree on
the set where either exceeds for some 0 < < sup U, then
log E(U
n
) log E(V
n
) 0 as n
Proof: It suces to show that this holds for V that vanishes where U
In this case 0 U
n
V
n
and therefore
E(V
n
) E(U
n
) E(V
n
) +
n
= E(V
n
)
_
1 +

n
E(V
n
)
_
From (E(V
n
))
1/n
sup V and sup V = sup U , we know

E(V
n
)
1/n
< 1
in limits and hence

n
E(V
n
)
0, and hence log(1 +

n
E(V
n
)
) 0, which
establishes the result.
p
Proof of Lemma 3
Lemma 3. For some 0 < < e
A
where A = sup(Y b()), a vector
0
, and some positive
1
and
2
, the following holds wherever
exp(Y b()) > :
A
1

0

2
< Y b() < A
2

0

2
Proof: As

b() = COV(y), it is positive denite. Therefore, Y b()
is strictly convex and attains its maximum. Let
0
= argmax(Y b()).
The second order Taylor expansion at
0
yields:
Y b() = A +
1
2
(
0
)
T
b(
0
)(
0
)
If 2
1
and 2
2
are the larger and smaller than all the eigenvalues of

b(
0
).
It is now easy to determine < e
A
so that it will bound exp(Y b())
outside the neighborhood.
p

CP, AIC, BIC-Three Critera For Selecting Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CP, AIC, BIC-Three Critera For Selecting Model

Uploaded by

Copyright:

Available Formats

C

Rachel Fan (Statistics Department Columbia University) C

You might also like