You are on page 1of 8

CSE 291: Unsupervised learning Spring 2008

Lecture 7 Spectral methods


7.1 Linear algebra review
7.1.1 Eigenvalues and eigenvectors
Denition 1. A d d matrix M has eigenvalue if there is a d-dimensional vector u = 0 for which
Mu = u. This u is the eigenvector corresponding to .
In other words, the linear transformation M maps vector u into the same direction. It is interesting
that any linear transformation necessarily has directional xed points of this kind. The following chain of
implications helps in understanding this:
is an eigenvalue of M
there exists u = 0 with Mu = u
there exists u = 0 with (MI)u = 0
(MI) is singular (that is, not invertible)
det(MI) = 0.
Now, det(MI) is a polynomial of degree d in . As such it has d roots (although some of them might be
complex). This explains the existence of eigenvalues.
A case of great interest is when M is real-valued and symmetric, because then the eigenvalues are real.
Theorem 2. Let M be any real symmetric d d matrix. Then:
1. M has d real eigenvalues
1
, . . . ,
d
(not necessarily distinct).
2. There is a set of d corresponding eigenvectors u
1
, . . . , u
d
that constitute an orthonormal basis of R
d
,
that is, u
i
u
j
=
ij
for all i, j.
7.1.2 Spectral decomposition
The spectral decomposition recasts a matrix in terms of its eigenvalues and eigenvectors. This representation
turns out to be enormously useful.
Theorem 3. Let M be a real symmetric d d matrix with eigenvalues
1
, . . . ,
d
and corresponding or-
thonormal eigenvectors u
1
, . . . , u
d
. Then:
1. M =
_
_
_
_

u
1
u
2
u
d

_
_
_
_
_
. .
call this Q
_
_
_
_
_

1
0

2
.
.
.
0
d
_
_
_
_
_
. .

_
_
_
_
_
u
1

u
2

.
.
.
u
d

_
_
_
_
_
. .
Q
T
.
7-1
CSE 291 Lecture 7 Spectral methods Spring 2008
2. M =

d
i=1

i
u
i
u
T
i
.
Proof. A general proof strategy is to observe that M represents a linear transformation x Mx on R
d
,
and as such, is completely determined by its behavior on any set of d linearly independent vectors. For
instance, {u
1
, . . . , u
d
} are linearly independent, so any d d matrix N that satises Nu
i
= Mu
i
(for all i)
is necessarily identical to M.
Lets start by verifying (1). For practice, well do this two dierent ways.
Method One: For any i, we have
QQ
T
u
i
= Qe
i
= Q
i
e
i
=
i
Qe
i
=
i
u
i
= Mu
i
.
Thus QQ
T
= M.
Method Two: Since the u
i
are orthonormal, we have Q
T
Q = I. Thus Q is invertible, with Q
1
= Q
T
;
whereupon QQ
T
= I. For any i,
Q
T
MQe
i
= Q
T
Mu
i
= Q
T

i
u
i
=
i
Q
T
u
i
=
i
e
i
= e
i
.
Thus = Q
T
MQ, which implies M = QQ
T
.
Now for (2). Again we use the same proof strategy. For any j,
_

i
u
i
u
T
i
_
u
j
=
j
u
j
= Mu
j
.
Hence M =

i

i
u
i
u
T
i
.
7.1.3 Positive semidenite matrices
We now introduce an important subclass of real symmetric matrices.
Denition 4. A real symmetric d d matrix M is positive semidenite (denoted M 0) if z
T
Mz 0 for
all z R
d
. It is positive denite (denoted M 0) if z
T
Mz > 0 for all nonzero z R
d
.
Example 5. Consider any random vector X R
d
, and let = EX and S = E[(X )(X )
T
] denote its
mean and covariance, respectively. Then S 0 because for any z R
d
,
z
T
Sz = z
T
E[(X )(X )
T
]z = E[(z
T
(X ))((X )
T
z)] = E[(z (X ))
2
] 0.
Positive (semi)deniteness is easily characterized in terms of eigenvalues.
Theorem 6. Let M be a real symmetric d d matrix. Then:
1. M is positive semidenite i all its eigenvalues
i
0.
2. M is positive denite i all its eigenvalues
i
> 0.
Proof. Lets prove (1) (the second is similar). Let
1
, . . . ,
d
be the eigenvalues of M, with corresponding
eigenvectors u
1
, . . . , u
d
.
First, suppose M 0. Then for all i,
i
= u
T
i
Mu
i
0.
Conversely, suppose that all the
i
0. Then for any z R
d
, we have
z
T
Mz = z
T
_
d

i=1

i
u
i
u
T
i
_
z =
d

i=1

i
(z u)
2
0.
7-2
CSE 291 Lecture 7 Spectral methods Spring 2008
7.1.4 The Rayleigh quotient
One of the reasons why eigenvalues are so useful is that they constitute the optimal solution of a very basic
quadratic optimization problem.
Theorem 7. Let M be a real symmetric dd matrix with eigenvalues
1

2

d
, and corresponding
eigenvectors u
1
, . . . , u
d
. Then:
max
z=1
z
T
Mz = max
z=0
z
T
Mz
z
T
z
=
1
min
z=1
z
T
Mz = min
z=0
z
T
Mz
z
T
z
=
d
and these are realized at z = u
1
and z = u
d
, respectively.
Proof. Denote the spectral decomposition by M = QQ
T
. Then:
max
z=0
z
T
Mz
z
T
z
= max
z=0
z
T
QQ
T
z
z
T
QQ
T
z
(since QQ
T
= I)
= max
y=0
y
T
y
y
T
y
(writing y = Q
T
z)
= max
y=0

1
y
2
1
+ +
d
y
2
d
y
2
1
+ + y
2
d

1
,
where equality is attained in the last step when y = e
1
, that is, z = Qe
1
= u
1
. The argument for the
minimum is identical.
Example 8. Suppose random vector X R
d
has mean and covariance matrix M. Then z
T
Mz represents
the variance of X in direction z:
var(z
T
X) = E[(z
T
(X ))
2
] = E[z
T
(X )(X )
T
z] = z
T
Mz.
Theorem 7 tells us that the direction of maximum variance is u
1
, and that of minimum variance is u
d
.
Continuing with this example, suppose that we are interested in the k-dimensional subspace (of R
d
) that
has the most variance. How can this be formalized?
To start with, we will think of a linear projection from R
d
to R
k
as a function x P
T
x, where P
T
is
a k d matrix with P
T
P = I
k
. The last condition simply says that the rows of the projection matrix are
orthonormal.
When a random vector X R
d
is subjected to such a projection, the resulting k-dimensional vector has
covariance matrix
cov(P
T
X) = E[P
T
(X )(X )
T
P] = P
T
MP.
Often we want to summarize the variance by just a single number rather than an entire matrix; in such
cases, we typically use the trace of this matrix, and we write var(P
T
X) = tr(P
T
MP). This is also equal
to EP
T
X P
T

2
. With this terminology established, we can now determine the projection P
T
that
maximizes this variance.
Theorem 9. Let M be a real symmetric d d matrix as in Theorem 7. Pick any k d.
max
PR
dk
,P
T
P=I
tr(P
T
MP) =
1
+ +
k
min
PR
dk
,P
T
P=I
tr(P
T
MP) =
dk+1
+ +
d
.
7-3
CSE 291 Lecture 7 Spectral methods Spring 2008
These are realized when the columns of P span the k-dimensional subspace spanned by {u
1
, . . . , u
k
} and
{u
dk+1
, . . . , u
d
}, respectively.
Proof. We will prove the result for the maximum; the other case is symmetric. Let p
1
, . . . , p
k
denote the
columns of P. Then
tr
_
P
T
MP
_
=
k

i=1
p
T
i
Mp
i
=
k

i=1
p
T
i
_
_
d

j=1

j
u
j
u
T
j
_
_
p
i
=
d

j=1

j
k

i=1
(p
i
u
j
)
2
.
We will show that this quantity is at most
1
+ +
k
. To this end, let z
j
denote

k
i=1
(p
i
u
j
)
2
; clearly
it is nonnegative. We will show that

j
z
j
= k and that each z
j
1; the desired bound is then immediate.
First,
d

j=1
z
j
=
k

i=1
d

j=1
(p
i
u
j
)
2
=
k

i=1
d

j=1
p
T
i
u
j
u
T
j
p
i
=
k

i=1
p
T
i
QQ
T
p
i
=
k

i=1
p
i

2
= k.
To upper-bound an individual z
j
, start by extending the k orthonormal vectors p
1
, . . . , p
k
to a full orthonor-
mal basis p
1
, . . . , p
d
of R
d
. Then
z
j
=
k

i=1
(p
i
u
j
)
2

i=1
(p
i
u
j
)
2
=
d

i=1
u
T
j
p
i
p
T
i
u
j
= u
T
j

2
= 1.
It then follows that
tr
_
P
T
MP
_
=
d

j=1

j
z
j

1
+ +
k
,
and equality holds when p
1
, . . . , p
k
span the same space as u
1
, . . . , u
k
.
7.2 Principal component analysis
Let X R
d
be a random vector. We wish to nd the single direction that captures as much as possible of the
variance of X. Formally: we want p R
d
(the direction) such that p = 1, so as to maximize var(p
T
X).
Theorem 10. The solution to this optimization problem is to make p the principal eigenvector of cov(X).
Proof. Denote = EX and S = cov(X) = E[(X )(X )
T
]. For any p R
d
, the projection p
T
X has
mean E[p
T
X] = p
T
and variance
var(p
T
X) = E[(p
T
X p
T
)
2
] = E[p
T
(X )(X )
T
p] = p
T
Sp.
By Theorem 7, this is maximized (over all unit-length p) when p is the principal eigenvector of S.
Likewise, the k-dimensional subspace that captures as much as possible of the variance of X is simply
the subspace spanned by the top k eigenvectors of cov(X); call these u
1
, . . . , u
k
.
Projection onto these eigenvectors is called principal component analysis (PCA). It can be used to reduce
the dimension of the data from d to k. Here are the steps:
Compute the mean and covariance matrix S of the data X.
Compute the top k eigenvectors u
1
, . . . , u
k
of S.
Project X P
T
X, where P
T
is the k d matrix whose rows are u
1
, . . . , u
k
.
7-4
CSE 291 Lecture 7 Spectral methods Spring 2008
7.2.1 The best approximating ane subspace
Weve seen one optimality property of PCA. Heres another: it is the k-dimensional ane subspace that best
approximates X, in the sense that the expected squared distance from X to the subspace is minimized.
Lets formalize the problem. A k-dimensional ane subspace is given by a displacement p
o
R
d
and a set
of (orthonormal) basis vectors p
1
, . . . , p
k
R
d
. The subspace itself is then {p
o
+
1
p
1
+ +
k
p
k
:
i
R}.
origin
p
1
p
o
p
2
subspace
The projection of X R
d
onto this subspace is P
T
X +p
o
, where P
T
is the k d matrix whose rows are
p
1
, . . . , p
k
. Thus, the expected squared distance from X to this subspace is EX (P
T
X +p
o
)
2
. We wish
to nd the subspace for which this is minimized.
Theorem 11. Let and S denote the mean and covariance of X, respectively. The solution of this opti-
mization problem is to choose p
1
, . . . , p
k
to be the top k eigenvectors of S and to set p
o
= (I P
T
).
Proof. Fix any matrix P; the choice of p
o
that minimizes EX (P
T
X + p
o
)
2
is (by calculus) p
o
=
E[X P
T
X] = (I P
T
).
Now lets optimize P. Our cost function is
EX (P
T
X +p
o
)
2
= E(I P
T
)(X )
2
= EX
2
EP
T
(X )
2
,
where the second step is simply an invocation of the Pythagorean theorem.
origin
X
P
T
(X )
Therefore, we need to maximize EP
T
(X )
2
= var(P
T
X), and weve already seen how to do this in
Theorem 10 and the ensuing discussion.
7-5
CSE 291 Lecture 7 Spectral methods Spring 2008
7.2.2 The projection that best preserves interpoint distances
Suppose we want to nd the k-dimensional projection that minimizes the expected distortion in interpoint
distances. More precisely, we want to nd the k d projection matrix P
T
(with P
T
P = I
k
) such that, for
i.i.d. random vectors X and Y , expected squared distortion E[X Y
2
P
T
X P
T
Y
2
] is minimized
(of course, the term in brackets is always positive).
Theorem 12. The solution is to make the rows of P
T
the top k eigenvectors of cov(X).
Proof. This time we want to maximize
EP
T
X P
T
Y
2
= 2 EP
T
X P
T

2
= 2 var(P
T
X),
and once again were back to our original problem.
This is emphatically not the same as nding the linear transformation (that is, not necessarily a projec-
tion) P
T
for which E

X Y
2
P
T
X P
T
Y
2

is minimized. The random projection method that we


saw earlier falls in this latter camp, because it consists of a projection followed by a scaling by
_
d/k.
7.2.3 A prelude to k-means clustering
Suppose that for random vector X R
d
, the optimal k-means centers are

1
, . . . ,

k
, with cost
opt = EX (nearest

i
)
2
.
If instead, we project X into the k-dimensional PCA subspace, and nd the best k centers
1
, . . . ,
k
in that
subspace, how bad can these centers be?
Theorem 13. cost(
1
, . . . ,
k
) 2 opt.
Proof. Without loss of generality EX = 0 and the PCA mapping is X P
T
X. Since
1
, . . . ,
k
are the
best centers for P
T
X, it follows that
EP
T
X (nearest
i
)
2
EP
T
(X (nearest

i
))
2
EX (nearest

i
)
2
= opt.
Let X A
T
X denote projection onto the subspace spanned by

1
, . . . ,

k
. From our earlier result on
approximating ane subspaces, we know that EX P
T
X
2
EX A
T
X
2
. Thus
cost(
1
, . . . ,
k
) = EX (nearest
i
)
2
= EP
T
X (nearest
i
)
2
+EX P
T
X
2
(Pythagorean theorem)
opt +EX A
T
X
2
opt +EX (nearest

i
)
2
= 2 opt.
7.3 Singular value decomposition
For any real symmetric d d matrix M, we can nd its eigenvalues
1

d
and corresponding
orthonormal eigenvectors u
1
, . . . , u
d
, and write
M =
d

i=1

i
u
i
u
T
i
.
7-6
CSE 291 Lecture 7 Spectral methods Spring 2008
The best rank-k approximation to M is
M
k
=
k

i=1

i
u
i
u
T
i
,
in the sense that this minimizes MM
k

2
F
over all rank-k matrices. (Here
F
denotes Frobenius norm;
it is the same as L
2
norm if you imagine the matrix rearranged into a very long vector.)
In many applications, M
k
is an adequate approximation of M even for fairly small values of k. And it
is conveniently compact, of size O(kd).
But what if we are dealing with a matrix M that is not square; say it is mn with m n:
n
M
m
To nd a compact approximation in such cases, we look at M
T
M or MM
T
, which are square. Eigende-
compositions of these matrices lead to a good representation of M.
7.3.1 The relationship between MM
T
and M
T
M
Lemma 14. M
T
M and MM
T
are symmetric positive semidenite matrices.
Proof. Well do M
T
M; the other is similar. First o, it is symmetric:
(M
T
M)
ij
=

k
(M
T
)
ik
M
kj
=

k
M
ki
M
kj
=

k
(M
T
)
jk
M
ki
= (M
T
M)
ji
.
Next, M
T
M 0 since for any z R
n
, we have z
T
M
T
Mz = Mz
2
0.
Which one should we use, M
T
M or MM
T
? Well, they are of dierent sizes, nn and mm respectively.
MM
T
M
T
M
m
n
n m
Ideally, wed prefer to deal with the smaller of two, MM
T
, especially since eigenvalue computations are
expensive. Fortunately, it turns out the two matrices have the same (non-zero) eigenvalues!
Lemma 15. If is an eigenvalue of M
T
M with eigenvector u, then
either: (i) is an eigenvalue of MM
T
with eigenvector Mu,
or (ii) = 0 and Mu = 0.
7-7
CSE 291 Lecture 7 Spectral methods Spring 2008
Proof. Say = 0; well prove that condition (i) holds. First of all, M
T
Mu = u = 0, so certainly Mu = 0.
It is an eigenvector of MM
T
with eigenvalue , since
MM
T
(Mu) = M(M
T
Mu) = M(u) = (Mu).
Next, suppose = 0; well establish condition (ii). Notice that
Mu
2
= u
T
M
T
Mu = u
T
(M
T
Mu) = u
T
(u) = 0.
Thus it must be the case that Mu = 0.
7.3.2 A spectral decomposition for rectangular matrices
Lets summarize the consequences of Lemma 15. We have two square matrices, a large one (M
T
M) of size
nn and a smaller one (MM
T
) of size mm. Let the eigenvalues of the large matrix be
1

2

n
,
with corresponding orthonormal eigenvectors u
1
, . . . , u
n
. From the lemma, we know that at most m of the
eigenvalues are nonzero.
The smaller matrix MM
T
has eigenvalues
1
, . . . ,
m
, and corresponding orthonormal eigenvectors
v
1
, . . . , v
m
. The lemma suggests that v
i
= Mu
i
; this is certainly a valid set of eigenvectors, but they
are not necessarily normalized to unit length. So instead we set
v
i
=
Mu
i
Mu
i

=
Mu
i
_
u
T
i
M
T
Mu
i
=
Mu
i

i
.
This nally gives us the singular value decomposition, a spectral decomposition for general matrices.
Theorem 16. Let M be a rectangular mn matrix with m n. Dene
i
, u
i
, v
i
as above. Then
M =
_
_
_
_

v
1
v
2
v
m

_
_
_
_
_
. .
Q
1
, size mm
_
_
_
_
_

1
0

2
0
.
.
.
0

m
_
_
_
_
_
. .
, size mn
_
_
_
_
_
u
1

u
2

.
.
.
u
n

_
_
_
_
_
. .
Q
T
2
, size n n
.
Proof. We will check that = Q
T
1
MQ
2
. By our proof strategy from Theorem 3, it is enough to verify that
both sides have the same eect on e
i
for all 1 i n. For any such i,
Q
T
1
MQ
2
e
i
= Q
T
1
Mu
i
=
_
Q
T
1

i
v
i
if i m
0 if i > m
_
=
_

i
e
i
if i m
0 if i > m
_
= e
i
.
The alternative form of the singular value decomposition is
M =
m

i=1
_

i
v
i
u
T
i
,
which immediately yields a rank-k approximation
M
k
=
k

i=1
_

i
v
i
u
T
i
.
As in the square case, M
k
is the rank-k matrix that minimizes MM
k

2
F
.
7-8

You might also like