You are on page 1of 28

Generalized Discriminant

Analysis

Chapter 12
Outline
• Flexible Discriminant Analysis(FDA)
• Penalized Discriminant Analysis
• Mixture Discriminant Analysis (MDA)

1
Linear Discriminant Analysis
• According to the Bayes optimal classification
mentioned in chapter 2, the posteriors is needed.
post probability : PrG | X 
assume:
f k  x ——condition-density of X in class G=k.
 k ——prior probability of class k, with k 1 k  1
K

Bayes theorem give us the discriminant:


f k x  k
Pr G  k | X  x  
 f l x  l
K
l 1

2
2018/10/25 Linear Classifiers
Linear Discriminant Analysis
• Multivariate Gaussian density:

f k x  
1  12  x  k T k1  x  k 
e
2  p/2
k
1/ 2

• Comparing two classes k and l , assume k  , k


Pr(G  k | X  x) f k ( x) k
log  log  log
Pr(G  l | X  x) f l ( x) l
k 1
 log  (  k  l ) 1 (  k  l )
l 2
 xT  1 (  k  l )
3
2018/10/25 Linear Classifiers
Virtues and Failings of LDA
• Simple prototype (centriod) classifier
– New observation classified into the class with the closest
centroid
• But uses Mahalonobis distance

• Simple decision rules based on linear decision boundaries


• Estimated Bayes classifier for Gaussian class conditionals
• But data might not be Gaussian

• Provides low dimensional view of data


– Using discriminant functions as coordinates
• Often produces best classification results
– Simplicity and low variance in estimation

4
Virtues and Failings of LDA
• LDA may fail in number of situations
– Often linear boundaries fail to separate classes
– With large N, may estimate quadratic decision boundary
– May want to model even more irregular (non-linear)
boundaries
• Single prototype per class may not be insufficient

– May have many (correlated) predictors for digitized


analog signals.
• Too many parameters estimated with high variance, and the
performance suffers
• May want to regularize

5
Generalization of LDA
• Flexible Discriminant Analysis (FDA)
– LDA in enlarged space of predictors via basis expansions

• Penalized Discriminant Analysis (PDA)


– With too many predictors, do not want to expand the set: Already too
large
– Fit LDA model with penalized coefficient to be smooth/coherent in
spatial domain
– With large number of predictors, could use penalized FDA

• Mixture Discriminant Analysis (MDA)


– Model each class by a mixture of two or more Gaussians with
different centroids, all sharing same covariance matrix
• Allows for subspace reduction

6
Flexible Discriminant Analysis
• Suppose  : G R1 is a function that assigns
scores to the classes, such that the transformed
class labels are optimally predicted by linear
regression on X.
• Training data ( xi , gi ), i  1,2,....N 
• Objective Function

min
 
,
( ( g )  x
i
i
T
i ) 2

7
Flexible Discriminant Analysis
• More generally, we define L independent scores for
class labelling 1 ,2 , , L and corresponding
linear maps l ( X )  X T  l , l  1, 2, , L
• Objective Function
L
1
ASR 
N

l 1
 l i i l
(
i 1
( g )  x T
 ) 2

8
Flexible Discriminant Analysis

• Linear regression on  Obtain mutually linear


derived responses for K- score functions as
class problem discriminant (canonical)
– Define indicator variables
variables for each class
(K in all)  Classify into the nearestT
– Using indicator functions class centroid l ( x)  x l
as responses to create a
set of Y variables
 Mahalanobis distance
of a test point x to kth
class centroid

9
Flexible Discriminant Analysis

 Mahalanobis distance
of a test point x to k-th • We can replace linear
class centroid
regression fits l ( x)  x l by
T

non-parametric fits, e.g.,


 J ( x, ˆ k )   1 w (ˆ ( x)  k )  D( x),
K 1

generalized additive fits, spline


  Ave.{ˆ ( xi )} in class k,
k k
functions, MARS models etc.,
D( x) does not depend on k,
with a regularizer or kernel
r  residual mean square of the
2
regression and possibly
th optimal score , and wl  1/ rl 2 (1  rl 2 )
reduced rank regression

10
Computation of FDA
1. Multivariate nonparametric regression
2. Optimal scores
3. Update the model from step 1 using the optimal
scores

11
Computing the FDA Estimates
• For class gi, we define a N×K index response matrix Y,
such that if gi = k,then yik = 1, yjk = 0。
• Procedure:
1. Multivariate nonparametric regression. Fit a multiresponse, adaptive
nonparametric regression of Y on X, giving fitted values Ŷ . Let S  be the
linear operator that fits the final chosen model, and  ( x ) be the vector of

fitted regression functions.


T ˆ
2. Optimal scores. Compute the eigen-decomposition of Y Y  YT S Y ,
where the eigenvectors Θ are normalized: ΘT D Θ = I . Here D  YT Y N
is a diagonal matrix of the estimated class prior probabilities.

3. Update the model from step 1 using the optimal scores:( x)  Θ  ( x) .


T 
Speech Recognition Data
• K=11 classes
– spoken vowels sound
• p=10 predictors extracted from digitized speech
• FDA uses adaptive additive-spline regression (BRUTO in S-
plus)
• FDA/MARS Uses Multivariate Adaptive Regression Splines;
degree=2 allows pairwise products

13
LDA Vs. FDA/BRUTO

14
Penalized Discriminant Analysis
PDA is a regularized discriminant analysis on
enlarged set of predictors via a basis expansion

1 L
 N

ASR({l , l }l 1 )    (l ( gi )  h ( xi ) l )  l l 
L T 2 T

N l 1  i 1 

• Choice of Ω depends on problems, if we use


l ( x)  h( x)l , smoothing constraint ηl may be
imposed on it.
15
Penalized Discriminant Analysis
• PDA enlarge the predictors to h(x)
• Use LDA in the enlarged space, with the penalized
Mahalanobis distance:

D( x,  )  (h( x )  h(  )) ( W ) ( h( x )  h(  ))
T 1

with W as within-class Cov

16
Penalized Discriminant Analysis

• Decompose the classification subspace using


the penalized metric:

max  T
 Bet

subject to  ( W  )   1
T

17
USPS Digit Recognition

18
Digit Recognition-LDA vs. PDA

19
PDA Canonical Variates

20
Mixture Discriminant Analysis
• The class conditional densities modeled as mixture
of Gaussians
– Possibly different # of components in each class
– Estimate the centroids and mixing proportions in each
subclass by max joint likelihood P(G, X)
– EM algorithm for MLE

• Could use penalized estimation

21
Mixture Discriminant Analysis
• A Gaussian mixture model for the k-th class

p( X | G  k )   kj ( X ;kj , )
j 1

• The Posterior
 k  kj ( X ;kj , )
j 1
p (G  k | X  x ) 
   ( X ; , )
l
l
j 1
lj lj

22
Mixture Discriminant Analysis
• Maximum likelihood

K
 
 log  k  kj ( xi ;kj , ) 
k 1 g  k  j 1 

23
FDA and MDA

24
Waveform Signal with Additive Gaussian
Noise

Class 1: Xj = U h1(j) + (1-U)h2(j) +j

Class 2: Xj = U h1(j) + (1-U)h3(j) +j

Class 3: Xj = U h2(j) + (1-U)h3(j) +j

Where j = 1,L, 21, and U ~ Unif(0,1)

h1(j) = max(6-|j-11|,0)

h2(j) = h1(j-4)

h3(j) = h1(j+4)

25
Wave From Data Results

26
The End

27

You might also like