Professional Documents
Culture Documents
A. Learning kernels
B. Cross-Validation
1
The cross-validation (CV) solution is obtained as follows. Suppose the
learner receives an i.i.d. sample S of size m 1. He randomly divides S
into a sample S1 of size (1 )m and a sample S2 of size m, where is in
(0, 1), with typically small. S1 is used for training, S2 for validation. For
any k N, let b hk denote the solution of ERM run on S1 using hypothesis
set Hk . The learner then uses sample S2 to return the CV solution fCV =
argminkN R bS (b
2 hk ).
Hypothesis b
hk is fixed conditioned on S1 . Furthermore, sample S2 is
independent from sample S1 therefore, by Hoeffdings inequality we
can bound the conditional probability by
r q 2
log k
2m + log k
P R(hk ) R
b bS (hk ) > +
2
b |S1 2e m .
m
2
e2m 2 log k
1 2
= 2 e2m .
k
Replacing this bound in (1) and summing over k we obtain
r
2
> + log k e2m2 < 4e2m2 .
P sup R(bhk ) R
bS (b
2 h k )
k1 m 3
2
2. Let R(fSRM , S1 ) be the generalization error of the SRM solution using
a sample S1 of size (1 m) and R(fCV , S) the generalization error
of the cross-validation solution using a sample S of size m. Use the
previous question to prove that for any > 0, with probability at least
1 the following holds:
s r
log 4 log max(k(fCV ), k(fSRM ))
R(fCV , S) R(fSRM , S1 ) 2 +2 ,
2m m
where, as for the notation used in class, for any h, k(h) denotes the
smallest index of a hypothesis set contained h. Comment on the bound
derived: point out both the usefulness its suggests for CV and its
possible drawback in some bad cases.
3
Solution: From the fact that R 1 hk+1 ) < RS1 (hk ) it follows that
bS (b b b
1
R 1 hk+1 ) RS1 (hk ) m . Thus, by induction, RS1 (hn ) = 0 for all
bS (b b b b b
n m+1. This implies that hn = hm+1 for all n m+1 and therefore
b b
we may assume that k(fCV ) m + 1 and since the complexity of Hk
increases with k we also have k(fSRM ) m + 1. In view of this, we
obtain the more explicit bound
s r
log( 4 ) log(m + 1)
R(fCV , S) R(fSRM , S1 ) 2 +2
2m m
C. CRF
In class, we discussed the learning algorithms for CRF in the case of bigram
features.
(ykn+1 , . . . , yk , k)
0
And there will be an edge between nodes (ykn 0
, . . . , yk1 , k 1) and
(ykn+1 , . . . , yk ). If an only if
yi0 = yi i {k n + 1, . . . , k 1}.
4
column is rn1 . Moreover each node in column k 1 can be matched
with exactly r nodes in column k. Therefore each column has rn edges.
A similar argument shows that for k (n 1) each column has rk
edges and the 0-th column has r edges. Therefore the whole graph has
n1
X rn1 1
(l (n 1))rn + rk + r = (l (n 1))rn + r +r
r1
k=1