You are on page 1of 5

Mehryar Mohri

Advanced Machine Learning 2015


Courant Institute of Mathematical Sciences
Homework assignment 1
February 27, 2015
Due: March 16, 2015

A. Learning kernels

Consider the learning kernel optimization based on SVM:

min max 2 > 1 > Y> K Y


subject to: 0 C > y = 0,

where = { : kk 1 0} and where for the rest the notation


is the one used in the lecture slides. Show that its solution coincides with
the SVM solution for the uniform combination kernel.
Solution: Let f (, ) = 21 nj=1 j > Y> Kj Y, C = {|0
P

C > y = 0}. The function f is a linear in and therefore convex. Since


Kj is positive definite it also follows that f is a concave function of . Fur-
thermore, C and are both compact convex sets. By the minimax theo-
rem we therefore have min maxC f (, ) = maxC min f (, ).
Using the definition of f we see
n
X
min f (, ) = min 21 j > Y> KY

j=1
Xn
= 21 max j > Y> Kj Y.

j=1

Since > Y> Kj Y 0 for all j, the above maximum is attained at j =


1j. Therefore, the solution coincides with the uniform combination kernel.

B. Cross-Validation

The objective of this problem is to derive a learning bound for cross-validation


comparing its performance to that of SRM. Let (Hk )kN be a countable se-
quence of hypothesis sets with increasing complexities.

1
The cross-validation (CV) solution is obtained as follows. Suppose the
learner receives an i.i.d. sample S of size m 1. He randomly divides S
into a sample S1 of size (1 )m and a sample S2 of size m, where is in
(0, 1), with typically small. S1 is used for training, S2 for validation. For
any k N, let b hk denote the solution of ERM run on S1 using hypothesis
set Hk . The learner then uses sample S2 to return the CV solution fCV =
argminkN R bS (b
2 hk ).

1. Prove the following inequality:


 r 
log k 2
Pr sup R(hk ) R
bS (b
h ) >  + 4e2m .
b
2 k
k1 m

Solution: By the union bound we have


r
 log k 
hk ) R
P sup R(b bS (b
2 hk ) >  +

k1 m
r
>  + log k
X  
hk ) R
P R(b bS (b
2 hk )
m
k=1
r
>  + log k |S1 .
X h  i
= E P R(b hk ) RbS (b
2 h k ) (1)
m
k=1

Hypothesis b
hk is fixed conditioned on S1 . Furthermore, sample S2 is
independent from sample S1 therefore, by Hoeffdings inequality we
can bound the conditional probability by
r q 2
 log k 
2m + log k
P R(hk ) R
b bS (hk ) >  +
2
b |S1 2e m .
m
2
e2m 2 log k
1 2
= 2 e2m .
k
Replacing this bound in (1) and summing over k we obtain
r
2
>  + log k e2m2 < 4e2m2 .
 
P sup R(bhk ) R
bS (b
2 h k )
k1 m 3

2
2. Let R(fSRM , S1 ) be the generalization error of the SRM solution using
a sample S1 of size (1 m) and R(fCV , S) the generalization error
of the cross-validation solution using a sample S of size m. Use the
previous question to prove that for any > 0, with probability at least
1 the following holds:
s r
log 4 log max(k(fCV ), k(fSRM ))
R(fCV , S) R(fSRM , S1 ) 2 +2 ,
2m m
where, as for the notation used in class, for any h, k(h) denotes the
smallest index of a hypothesis set contained h. Comment on the bound
derived: point out both the usefulness its suggests for CV and its
possible drawback in some bad cases.

Solution: In view of the previous bound we know that with probability


at least 1 the following holds:
s r
4
log log(k(fCV ))
R(fCV , S) R bS (fCV ) +
2 +
2m m
s r
4
R bS (fSRM ) + log + log(k(fCV ))
2
2m m
s r r
log 4 log(k(fCV )) log(k(fSRM ))
R(fSRM , S1 ) + 2 + +
2m m m
s r
log 4 log(max(k(fCV ), k(fSRM ))
R(fSRM , S1 ) + 2 +2 ,
2m m
where we have used the definition of fCV as a minimizer for the second
inequality and the third inequality follows again from the previuosly
derived bound. The bound tells us that the error of the classifier
obtained through cross validation will be close to the value of the
classifier obtained through SRM. However, we are interested in the
SRM solution of training on m points and the error of training on
(1 )m points could be, in some bad cases, much worse than its
error on m points.
3. Suppose that for any k, R bS (b
1 hk+1 ) < RS1 (hk ) for all k such that
b b
RS1 (hk ) > 0 and RS1 (hk+1 ) RS1 (hk ) otherwise. Show that we can
b b b b b b
then restrict the analysis to Hk s with k m + 1 and give a more
explicit guarantee similar to that of the previous question.

3
Solution: From the fact that R 1 hk+1 ) < RS1 (hk ) it follows that
bS (b b b
1
R 1 hk+1 ) RS1 (hk ) m . Thus, by induction, RS1 (hn ) = 0 for all
bS (b b b b b
n m+1. This implies that hn = hm+1 for all n m+1 and therefore
b b
we may assume that k(fCV ) m + 1 and since the complexity of Hk
increases with k we also have k(fSRM ) m + 1. In view of this, we
obtain the more explicit bound
s r
log( 4 ) log(m + 1)
R(fCV , S) R(fSRM , S1 ) 2 +2
2m m

C. CRF

In class, we discussed the learning algorithms for CRF in the case of bigram
features.

1. Write the expression of the features in the case of n-grams (arbitrary


n 1). Solution: Following the notation used in class we can write
the expression for n grams as

(x, y) = ((x, 1, y1 ), . . . , (x, 1, ykn , . . . , yk ), . . . , (x, l, yln , . . . , yl )).

2. Describe explicitly the key graph-based algorithms in the case of n-


grams. What is the running-time complexity of these algorithms?
Solution: In order to find

argmax w (x, y) = argmax w (x, k, ykn , . . . , yk )


yl yl

we use again a single-source shortest-distance algorithm. In the case


n1
of n-grams the corresponding graph has rn (l (n 1)) + r r r11 + r.
In order to see this, notice that nodes in column k are be of the form:

(ykn+1 , . . . , yk , k)
0
And there will be an edge between nodes (ykn 0
, . . . , yk1 , k 1) and
(ykn+1 , . . . , yk ). If an only if

yi0 = yi i {k n + 1, . . . , k 1}.

This edge will correspond to the dot product w (x, k, ykn , . . . , yk ).


It is easy to verify that for k (n 1) the number of nodes in each

4
column is rn1 . Moreover each node in column k 1 can be matched
with exactly r nodes in column k. Therefore each column has rn edges.
A similar argument shows that for k (n 1) each column has rk
edges and the 0-th column has r edges. Therefore the whole graph has
n1
X rn1 1
(l (n 1))rn + rk + r = (l (n 1))rn + r +r
r1
k=1

edges. Using a linear time complexity algorithm for finding a shortest


path it follows that the complexity of our algorithm is in O(lrn ).

You might also like