1
~
2
~
Fisher Linear Discriminant
The problem with is that it does not
consider the variance of the classes
2 1
~ ~
1
1
~
2
~
large variance
s
m
a
l
l
v
a
r
i
a
n
c
e
1
2
Fisher Linear Discriminant
We need to normalize by a factor which is
proportional to variance
2 1
~ ~
( (( ( ) )) )
= == =
= == =
n
i
z i
z s
1
2
Define their scatter as
Have samples z
1
,,z
n
. Sample mean is
= == =
= == =
n
i
i z
z
n
1
1
Thus scatter is just sample variance multiplied by n
scatter measures the same thing as variance, the spread
of data around the mean
scatter is just on different scale than variance
larger scatter:
smaller scatter:
Fisher Linear Discriminant
Fisher Solution: normalize by scatter
2 1
~ ~
( (( ( ) )) )
= == =
2 Class y
2
2 i
2
2
i
~
y s
~
Scatter for projected samples of class 2 is
Scatter for projected samples of class 1 is
( (( ( ) )) )
= == =
1
2
1
2
1
~
~
Class y
i
i
y s
Let y
i
= v
t
x
i
, i.e. y
i
s are the projected samples
Fisher Linear Discriminant
We need to normalize by both scatter of class 1 and
scatter of class 2
( (( ( ) )) )
( (( ( ) )) )
2
2
2
1
2
2 1
~ ~
~ ~
s s
v J
+ ++ +
= == =
Thus Fisher linear discriminant is to project on line
in the direction v which maximizes
want projected means are far from each other
want scatter in class 2 is as
small as possible, i.e. samples
of class 2 cluster around the
projected mean
2
~
want scatter in class 1 is as
small as possible, i.e. samples
of class 1 cluster around the
projected mean
1
~
Fisher Linear Discriminant
( (( ( ) )) )
( (( ( ) )) )
2
2
2
1
2
2 1
~ ~
~ ~
s s
v J
+ ++ +
= == =
If we find v which makes J(v) large, we are
guaranteed that the classes are well separated
1
~
2
~
small implies that
projected samples of
class 1 are clustered
around projected mean
1
~
s
small implies that
projected samples of
class 2 are clustered
around projected mean
2
~
s
projected means are far from each other
Fisher Linear Discriminant Derivation
All we need to do now is to express J explicitly as a
function of v and maximize it
straightforward but need linear algebra and Calculus
( (( ( ) )) )
( (( ( ) )) )
2
2
2
1
2
2 1
~ ~
~ ~
s s
v J
+ ++ +
= == =
Define the separate class scatter matrices S
1
and
S
2
for classes 1 and 2. These measure the scatter
of original samples x
i
(before projection)
( (( ( ) )) )( (( ( ) )) )
= == =
1
1 1 1
Class x
t
i i
i
x x S
( (( ( ) )) )( (( ( ) )) )
= == =
2
2 2 2
Class x
t
i i
i
x x S
Fisher Linear Discriminant Derivation
Now define the within the class scatter matrix
2 1
S S S
W
+ ++ + = == =
Recall that ( (( ( ) )) )
= == =
1
2
1
2
1
~
~
Class y
i
i
y s
( (( ( ) )) )
= == =
1 Class y
2
1
t
i
t 2
1
i
v x v s
~
Using y
i
= v
t
x
i
and
1 1
~
t
v = == =
( (( ( ) )) ) ( (( ( ) )) ) ( (( ( ) )) ) ( (( ( ) )) )
= == =
1 Class y
t
1 i
t
t
1 i
i
v x v x
( (( ( ) )) )( (( ( ) )) )
= == =
1 Class y
t
1 i 1 i
t
i
v x x v
v S v
t
1
= == =
( (( ( ) )) ) ( (( ( ) )) ) ( (( ( ) )) ) ( (( ( ) )) )
= == =
1 Class y
1 i
t
t
1 i
t
i
x v x v
Fisher Linear Discriminant Derivation
Similarly v S v s
t
2
2
2
~
= == =
Lets rewrite the separations of the projected means
Therefore v S v v S v v S v s s
W
t t t
= == = + ++ + = == = + ++ +
2 1
2
2
2
1
~ ~
( (( ( ) )) ) ( (( ( ) )) )
2
2 1
2
2 1
~ ~
t t
v v = == =
( (( ( ) )) )( (( ( ) )) ) v v
t
t
2 1 2 1
= == =
v S v
B
t
= == =
Define between the class scatter matrix
( (( ( ) )) )( (( ( ) )) )
t
B
S
2 1 2 1
= == =
S
B
measures separation between the means of two
classes (before projection)
Fisher Linear Discriminant Derivation
Thus our objective function can be written:
Minimize J(v) by taking the derivative w.r.t. v and
setting it to 0
( (( ( ) )) )
( (( ( ) )) )
2
v S v
v S v v S v
dv
d
v S v v S v
dv
d
v J
dv
d
W
t
B
t
W
t
W
t
B
t
  
. .. .
  
\ \\ \
  
  
. .. .
  
\ \\ \
  
= == =
( (( ( ) )) ) ( (( ( ) )) )
( (( ( ) )) )
2
2 2
v S v
v S v v S v S v v S
W
t
B
t
W W
t
B
= == =
0 = == =
( (( ( ) )) )
( (( ( ) )) )
v S v
v S v
s
~
s
~
~ ~
v J
W
t
B
t
2
2
2
1
2
2 1
= == =
+ ++ +
= == =
Fisher Linear Discriminant Derivation
Need to solve
( (( ( ) )) ) ( (( ( ) )) ) 0 = == = v S v S v v S v S v
W B
t
B W
t
( (( ( ) )) ) ( (( ( ) )) )
0 = == =
v S v
v S v S v
v S v
v S v S v
W
t
W B
t
W
t
B W
t
( (( ( ) )) )
0 = == =
v S v
v S v S v
v S
W
t
W B
t
B
v S v S
W B
= == =
= == =
generalized eigenvalue problem
Fisher Linear Discriminant Derivation
If S
W
has full rank (the inverse exists), can convert
this to a standard eigenvalue problem
v S v S
W B
= == =
v v S S
B W
= == =
1
But S
B
x for any vector x, points in the same
direction as
1

2
( (( ( ) )) )( (( ( ) )) ) x x S
t
B 2 1 2 1
= == =
( (( ( ) )) ) ( (( ( ) )) ) ( (( ( ) )) ) x
t
2 1 2 1
= == =
( (( ( ) )) )
2 1
= == =
Thus can solve the eigenvalue problem immediately
( (( ( ) )) )
2 1
1
= == =
W
S v
( (( ( ) )) ) [ [[ [ ] ]] ] ( (( ( ) )) ) [ [[ [ ] ]] ]
2 1
1
W 2 1
1
W B
1
W
S S S S = == =
v
v
( (( ( ) )) ) [ [[ [ ] ]] ]
2 1
1
W
S = == =
Fisher Linear Discriminant Example
Data
Class 1 has 5 samples c
1
=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c
2
=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]
Arrange data in 2 separate matrices
( (( (
( (( (
( (( (
= == =
5 5
2 1
c
1
( (( (
( (( (
( (( (
= == =
5 6
0 1
c
2
Notice that PCA performs very
poorly on this data because the
direction of largest variance is not
helpful for classification
Fisher Linear Discriminant Example
First compute the mean for each class
( (( ( ) )) ) [ [[ [ ] ]] ] 6 . 3 3 c mean
1 1
= == = = == = ( (( ( ) )) ) [ [[ [ ] ]] ] 2 3 . 3 c mean
2 2
= == = = == =
( (( ( ) )) )
( (( (
( (( (
= == = = == =
2 . 7 0 . 8
0 . 8 10
c cov 4 S
1 1
( (( ( ) )) )
( (( (
( (( (
= == = = == =
16 16
16 3 . 17
c cov 5 S
2 2
Compute scatter matrices S
1
and S
2
for each class
( (( (
( (( (
= == = + ++ + = == =
2 . 23 24
24 3 . 27
S S S
2 1 W
Within the class scatter:
it has full rank, dont have to solve for eigenvalues
( (( ( ) )) )
( (( (
( (( (
= == = = == =
47 . 0 41 . 0
41 . 0 39 . 0
S inv S
W
1
W
The inverse of S
W
is
Finally, the optimal line direction v
( (( ( ) )) )
( (( (
( (( (
= == = = == =
89 . 0
79 . 0
S v
2 1
1
W
Fisher Linear Discriminant Example
Notice, as long as the line
has the right direction, its
exact position does not
matter
Last step is to compute
the actual 1D vector y.
Lets do it separately for
each class
[ [[ [ ] ]] ] [ [[ [ ] ]] ] 4 . 0 81 . 0
5 2
5 1
73 . 0 65 . 0
1 1
= == =
( (( (
( (( (
= == = = == =
t t
c v Y
[ [[ [ ] ]] ] [ [[ [ ] ]] ] 25 . 0 65 . 0
5 0
6 1
73 . 0 65 . 0
2 2
= == =
( (( (
( (( (
= == = = == =
t t
c v Y
Multiple Discriminant Analysis (MDA)
Can generalize FLD to multiple classes
In case of c classes, can reduce dimensionality to
1, 2, 3,, c1 dimensions
Project sample x
i
to a linear subspace y
i
= V
t
x
i
V is called projection matrix
Multiple Discriminant Analysis (MDA)
Objective function:
( (( ( ) )) )
( (( ( ) )) )
( (( ( ) )) ) V S V det
V S V det
V J
W
t
B
t
= == =
within the class scatter matrix S
W
is
( (( ( ) )) )( (( ( ) )) )
= == = = == =
= == = = == =
c
1 i
t
i k
i class x
i k
c
1 i
i W
x x S S
k
between the class scatter matrix S
B
is
( (( ( ) )) )( (( ( ) )) )
t
i
c
1 i
i i B
n S = == =
= == =
Let
= == =
i class x
i
i
x
n
1
= == =
i
x
i
x
n
1
maximum rank is c 1
n
i
by the number of samples of class i
and
i
be the sample mean of class i
be the total mean of all samples
Multiple Discriminant Analysis (MDA)
( (( ( ) )) )
( (( ( ) )) )
( (( ( ) )) ) V S V det
V S V det
V J
W
t
B
t
= == =
The optimal projection matrix V to a subspace of
dimension k is given by the eigenvectors
corresponding to the largest k eigenvalues
First solve the generalized eigenvalue problem:
v S v S
W B
= == =
At most c1 distinct solution eigenvalues
Let v
1
, v
2
,, v
c1
be the corresponding eigenvectors
Thus can project to a subspace of dimension at
most c1
FDA and MDA Drawbacks
Reduces dimension only to k = c1 (unlike PCA)
For complex data, projection to even the best line may
result in unseparable projected samples
Will fail:
1. J(v) is always 0: happens if
1
=
2
PCA performs
reasonably well
here:
PCA also
fails:
2. If J(v) is always large: classes have large overlap when
projected to any line (PCA will also fail)