CS434a/541a: Dimensionality Reduction and Fisher Linear Discriminant

CS434a/541a: Pattern Recognition
Prof. Olga Veksler
Lecture 8
Today
Continue with Dimensionality Reduction

Last lecture: PCA
This lecture: Fisher Linear Discriminant
Data Representation vs. Data Classification
PCA finds the most accurate data representation
in a lower dimensional space
Project data in the directions of maximum variance
However the directions of maximum variance may
be useless for classification
separable
apply PCA not separable
to each class
Fisher Linear Discriminant project to a line which

preserves direction useful for data classification
Fisher Linear Discriminant
Main idea: find projection to a line s.t. samples
from different classes are well separated
Example in 2D
bad line to project to, good line to project to,

classes are mixed up classes are well separated
Suppose we have 2 classes and d-dimensional
samples x1,…,xn where
n1 samples come from the first class
n2 samples come from the second class
consider projection on a line
Let the line direction be given by unit vector v
Scalar vtxi is the distance of

projection of xi from the origin
txi
v xi Thus it vtxi is the projection of
v
xi into a one dimensional
subspace
Thus the projection of sample xi onto a line in
direction v is given by vtxi
How to measure separation between projections of
different classes?
Let µ~1 and µ~2 be the means of projections of
classes 1 and 2
Let µ1 and µ2 be the means of classes 1 and 2
µ~1 − µ~2 seems like a good measure
n1 n1
1 1
µ~1 = v t xi = v t x i = v t µ1
n1 x i ∈C 1 n1 x i ∈C 1
similarly , µ~ 2 = v t µ 2
How good is µ~1 − µ~2 as a measure of separation?
The larger µ~1 − µ~ 2 , the better is the expected separation
µ~1 µ1
µ~2 µ2
µ1 µ2
the vertical axes is a better line than the horizontal

axes to project to for class separability
however µ 1 − µ 2 > µ~1 − µ~2
The problem with µ~1 − µ~2 is that it does not
consider the variance of the classes
µ~1 µ1
small variance
µ~2 µ2
µ1 µ2
large variance
We need to normalize µ~1 − µ~2 by a factor which is
proportional to variance
1 n
Have samples z1,…,zn . Sample mean is µ z = n z i
i =1
Define their scatter as

n
s = (z i − µz )
2
i =1
Thus scatter is just sample variance multiplied by n

scatter measures the same thing as variance, the spread
of data around the mean
scatter is just on different scale than variance
larger scatter: smaller scatter:

Fisher Solution: normalize µ~1 − µ~2 by scatter
Let yi = vtxi , i.e. yi ‘s are the projected samples
Scatter for projected samples of class 1 is

s~12 = (y i − µ~1 ) 2
y i ∈Class 1
Scatter for projected samples of class 2 is

~2 =
s 2 ( y i − ~ )2
µ 2
y i ∈Class 2
We need to normalize by both scatter of class 1 and
scatter of class 2
Thus Fisher linear discriminant is to project on line
in the direction v which maximizes
want projected means are far from each other
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2
want scatter in class 1 is as want scatter in class 2 is as

small as possible, i.e. samples small as possible, i.e. samples
of class 1 cluster around the of class 2 cluster around the
projected mean µ ~ projected mean µ ~
1 2
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s 2
If we find v which makes J(v) large, we are
guaranteed that the classes are well separated
projected means are far from each other
µ~1 µ~2
small s~ implies that small s~ implies that

1 2
projected samples of projected samples of
class 1 are clustered class 2 are clustered
around projected mean around projected mean
Fisher Linear Discriminant Derivation
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2
All we need to do now is to express J explicitly as a
function of v and maximize it
straightforward but need linear algebra and Calculus
Define the separate class scatter matrices S1 and
S2 for classes 1 and 2. These measure the scatter
of original samples xi (before projection)
S1 = (x i − µ 1 )( x i − µ 1 )
t
x i ∈Class 1
S2 = (x i − µ 2 )( x i − µ 2 )
t
x i ∈Class 2
Now define the within the class scatter matrix
SW = S 1 + S 2
Recall that s~12 = (y i − µ~1 ) 2

y i ∈Class 1
Using yi = vtxi and µ~1 = v µ 1

t
~2 =
s 1 (v x − v µ )
t
i
t
1
2
y i ∈Class 1
= (v (x − µ )) (v (x
t
i 1
t t
i − µ 1 ))
y i ∈Class 1
= ((x i − µ1 ) v
t
) ((x
t
i − µ1 ) v
t
)
y i ∈Class 1
= v t
(x i − µ 1 )( x i − µ 1 ) v = v t S 1v
t
y i ∈Class 1
Similarly s~22 = v t S 2v
Therefore s~12 + s~22 = v t S 1v + v t S 2v = v t S W v
Define between the class scatter matrix
S B = (µ 1 − µ 2 )(µ 1 − µ 2 )
t
SB measures separation between the means of two

classes (before projection)
Let’s rewrite the separations of the projected means
(µ 1 − µ 2 ) = (v µ 1 − v µ 2 )
~ ~ 2 t t 2
= v (µ 1 − µ 2 )(µ 1 − µ 2 ) v
t t
= v t SBv
Thus our objective function can be written:
( µ1 − µ 2 )
~ ~
J (v ) = ~ 2 ~ 2 = t
2
v t S Bv
s1 + s 2 v SW v
Minimize J(v) by taking the derivative w.r.t. v and
setting it to 0
d t t d t
v S B v v SW v − v SW v v t S B v
d dv dv
J (v ) =
dv (v t
SW v )
2
=
(2 S B v )v t SW v − (2 SW v )v t S B v
=0
(v t
SW v )
2
Need to solve v t S W v (S B v ) − v t S B v (S W v ) = 0
v t S W v (S B v ) v t S B v (S W v )
t
− t
=0
v SW v v SW v
v t S B v (S W v )
SBv − t
=0
v SW v = λ
S B v = λ SW v
generalized eigenvalue problem

S B v = λ SW v
If SW has full rank (the inverse exists), can convert
this to a standard eigenvalue problem
S W− 1S B v = λ v
But SB x for any vector x, points in the same
direction as µ1 - µ2 α
S B x = (µ 1 − µ 2 )(µ 1 − µ 2 ) x = (µ 1 − µ 2 )((µ 1 − µ 2 )t x ) = α (µ 1 − µ 2 )
t
Thus can solve the eigenvalue problem immediately

v = S W− 1 (µ 1 − µ 2 )
−1
S SB
W [S (µ
−1
W 1 − µ2 )] [ (
= SW α µ 1 − µ 2
−1
)] = α [S W (µ 1 − µ 2 )]
−1
v λ v
Fisher Linear Discriminant Example
Data
Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]
Arrange data in 2 separate matrices
1 2 1 0
c1 = c2 =
5 5 6 5
Notice that PCA performs very

poorly on this data because the
direction of largest variance is not
helpful for classification
First compute the mean for each class
µ 1 = mean (c 1 ) = [3 3 . 6 ] µ 2 = mean (c 2 ) = [3 . 3 2]
Compute scatter matrices S1 and S2 for each class

S 1 = 4 ∗ cov (c 1 ) = 810 8 .0
.0 7 .2 S 2 = 5 ∗ cov (c 2 ) = 17 . 3 16
16 16
Within the class scatter:

S W = S 1 + S 2 = 27 . 3 24
24 23 . 2
it has full rank, don’t have to solve for eigenvalues
The inverse of SW is S W− 1 = inv (S W ) = −00. 39 − 0 . 41

. 41 0 . 47
Finally, the optimal line direction v
v = S W− 1 (µ 1 − µ 2 ) = −00. 89
. 79
Notice, as long as the line

has the right direction, its
exact position does not
matter
Last step is to compute

the actual 1D vector y.
Let’s do it separately for
each class
1
Y 1 = v t c 1t = [− 0 . 65 0 . 73 ] 2 5 = [0 . 81 0 .4 ]
5
1
Y 2 = v t c 2t = [− 0 . 65 0 . 73 ] 0 6 = [− 0 . 65 − 0 . 25 ]
5
Multiple Discriminant Analysis (MDA)
Can generalize FLD to multiple classes
In case of c classes, can reduce dimensionality to
1, 2, 3,…, c-1 dimensions
Project sample xi to a linear subspace yi = V txi
V is called projection matrix
Let ni by the number of samples of class i
and µi be the sample mean of class i
µ be the total mean of all samples
µi = 1 x µ = 1 xi
ni x ∈ class i n xi
det (V t S BV )
Objective function: J (V ) =
det (V t S W V )
within the class scatter matrix SW is
c c
SW = Si = (x k − µ i )( x k − µ i )
t
i =1 i = 1 x k ∈class i
between the class scatter matrix SB is

c
n i (µ i − µ )(µ i − µ )
t
SB =
i =1
maximum rank is c -1
det (V t S BV )
J (V ) =
det (V t S W V )
First solve the generalized eigenvalue problem:
S B v = λ SW v
At most c-1 distinct solution eigenvalues

Let v1, v2 ,…, vc-1 be the corresponding eigenvectors
The optimal projection matrix V to a subspace of
dimension k is given by the eigenvectors
corresponding to the largest k eigenvalues
Thus can project to a subspace of dimension at
most c-1
FDA and MDA Drawbacks
Reduces dimension only to k = c-1 (unlike PCA)
For complex data, projection to even the best line may
result in unseparable projected samples
Will fail:
1. J(v) is always 0: happens if µ1 = µ2
PCA performs PCA also

reasonably well fails:
here:
2. If J(v) is always large: classes have large overlap when
projected to any line (PCA will also fail)

CS434a/541a: Dimensionality Reduction and Fisher Linear Discriminant

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS434a/541a: Dimensionality Reduction and Fisher Linear Discriminant

Uploaded by

Copyright:

Available Formats

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Continue with Dimensionality Reduction

apply PCA not separable

Fisher Linear Discriminant project to a line which

bad line to project to, good line to project to,

Scalar vtxi is the distance of

the vertical axes is a better line than the horizontal

Define their scatter as

Thus scatter is just sample variance multiplied by n

larger scatter: smaller scatter:

Let yi = vtxi , i.e. yi ‘s are the projected samples

Scatter for projected samples of class 1 is

Scatter for projected samples of class 2 is

want projected means are far from each other

want scatter in class 1 is as want scatter in class 2 is as

small s~ implies that small s~ implies that

Recall that s~12 = (y i − µ~1 ) 2

Using yi = vtxi and µ~1 = v µ 1

SB measures separation between the means of two

generalized eigenvalue problem

Thus can solve the eigenvalue problem immediately

Notice that PCA performs very

Compute scatter matrices S1 and S2 for each class

Within the class scatter:

The inverse of SW is S W− 1 = inv (S W ) = −00. 39 − 0 . 41

Notice, as long as the line

Last step is to compute

between the class scatter matrix SB is

At most c-1 distinct solution eigenvalues

PCA performs PCA also

You might also like