Talk12 FlexibleDA 2014

Generalized Discriminant
Analysis
Chapter 12
Outline
• Flexible Discriminant Analysis(FDA)
• Penalized Discriminant Analysis
• Mixture Discriminant Analysis (MDA)
1
Linear Discriminant Analysis
• According to the Bayes optimal classification
mentioned in chapter 2, the posteriors is needed.
post probability : PrG | X 
assume:
f k  x ——condition-density of X in class G=k.
 k ——prior probability of class k, with k 1 k  1
K
Bayes theorem give us the discriminant:

f k x  k
Pr G  k | X  x  
 f l x  l
K
l 1
2
2018/10/25 Linear Classifiers
Linear Discriminant Analysis
• Multivariate Gaussian density:
f k x  
1  12  x  k T k1  x  k 
e
2  p/2
k
1/ 2
• Comparing two classes k and l , assume k  , k

Pr(G  k | X  x) f k ( x) k
log  log  log
Pr(G  l | X  x) f l ( x) l
k 1
 log  (  k  l ) 1 (  k  l )
l 2
 xT  1 (  k  l )
3
2018/10/25 Linear Classifiers
Virtues and Failings of LDA
• Simple prototype (centriod) classifier
– New observation classified into the class with the closest
centroid
• But uses Mahalonobis distance
• Simple decision rules based on linear decision boundaries

• Estimated Bayes classifier for Gaussian class conditionals
• But data might not be Gaussian
• Provides low dimensional view of data

– Using discriminant functions as coordinates
• Often produces best classification results
– Simplicity and low variance in estimation
4
Virtues and Failings of LDA
• LDA may fail in number of situations
– Often linear boundaries fail to separate classes
– With large N, may estimate quadratic decision boundary
– May want to model even more irregular (non-linear)
boundaries
• Single prototype per class may not be insufficient
– May have many (correlated) predictors for digitized

analog signals.
• Too many parameters estimated with high variance, and the
performance suffers
• May want to regularize
5
Generalization of LDA
• Flexible Discriminant Analysis (FDA)
– LDA in enlarged space of predictors via basis expansions
• Penalized Discriminant Analysis (PDA)

– With too many predictors, do not want to expand the set: Already too
large
– Fit LDA model with penalized coefficient to be smooth/coherent in
spatial domain
– With large number of predictors, could use penalized FDA
• Mixture Discriminant Analysis (MDA)

– Model each class by a mixture of two or more Gaussians with
different centroids, all sharing same covariance matrix
• Allows for subspace reduction
6
Flexible Discriminant Analysis
• Suppose  : G R1 is a function that assigns
scores to the classes, such that the transformed
class labels are optimally predicted by linear
regression on X.
• Training data ( xi , gi ), i  1,2,....N 
• Objective Function
min
 
,
( ( g )  x
i
i
T
i ) 2
7
• More generally, we define L independent scores for
class labelling 1 ,2 , , L and corresponding
linear maps l ( X )  X T  l , l  1, 2, , L
• Objective Function
L
1
ASR 
N

l 1
 l i i l
(
i 1
( g )  x T
 ) 2
8
• Linear regression on  Obtain mutually linear

derived responses for K- score functions as
class problem discriminant (canonical)
– Define indicator variables
variables for each class
(K in all)  Classify into the nearestT
– Using indicator functions class centroid l ( x)  x l
as responses to create a
set of Y variables
 Mahalanobis distance
of a test point x to kth
class centroid
9
 Mahalanobis distance
of a test point x to k-th • We can replace linear
class centroid
regression fits l ( x)  x l by
T
non-parametric fits, e.g.,

 J ( x, ˆ k )   1 w (ˆ ( x)  k )  D( x),
K 1
generalized additive fits, spline

  Ave.{ˆ ( xi )} in class k,
k k
functions, MARS models etc.,
D( x) does not depend on k,
with a regularizer or kernel
r  residual mean square of the
2
regression and possibly
th optimal score , and wl  1/ rl 2 (1  rl 2 )
reduced rank regression
10
Computation of FDA
1. Multivariate nonparametric regression
2. Optimal scores
3. Update the model from step 1 using the optimal
scores
11
Computing the FDA Estimates
• For class gi, we define a N×K index response matrix Y，
such that if gi = k，then yik = 1, yjk = 0。
• Procedure:
1. Multivariate nonparametric regression. Fit a multiresponse, adaptive
nonparametric regression of Y on X, giving fitted values Ŷ . Let S  be the
linear operator that fits the final chosen model, and  ( x ) be the vector of

fitted regression functions.

T ˆ
2. Optimal scores. Compute the eigen-decomposition of Y Y  YT S Y ,
where the eigenvectors Θ are normalized: ΘT D Θ = I . Here D  YT Y N
is a diagonal matrix of the estimated class prior probabilities.
3. Update the model from step 1 using the optimal scores:( x)  Θ  ( x) .

T 
Speech Recognition Data
• K=11 classes
– spoken vowels sound
• p=10 predictors extracted from digitized speech
• FDA uses adaptive additive-spline regression (BRUTO in S-
plus)
• FDA/MARS Uses Multivariate Adaptive Regression Splines;
degree=2 allows pairwise products
13
LDA Vs. FDA/BRUTO
14
Penalized Discriminant Analysis
PDA is a regularized discriminant analysis on
enlarged set of predictors via a basis expansion
1 L
 N

ASR({l , l }l 1 )    (l ( gi )  h ( xi ) l )  l l 
L T 2 T
N l 1  i 1 
•
• Choice of Ω depends on problems, if we use

l ( x)  h( x)l , smoothing constraint ηl may be
imposed on it.
15
• PDA enlarge the predictors to h(x)
• Use LDA in the enlarged space, with the penalized
Mahalanobis distance:
D( x,  )  (h( x )  h(  )) ( W ) ( h( x )  h(  ))
T 1
with W as within-class Cov
16
• Decompose the classification subspace using

the penalized metric:
max  T
 Bet

subject to  ( W  )   1
T
17
USPS Digit Recognition
18
Digit Recognition-LDA vs. PDA
19
PDA Canonical Variates
20
Mixture Discriminant Analysis
• The class conditional densities modeled as mixture
of Gaussians
– Possibly different # of components in each class
– Estimate the centroids and mixing proportions in each
subclass by max joint likelihood P(G, X)
– EM algorithm for MLE
• Could use penalized estimation
21
• A Gaussian mixture model for the k-th class
p( X | G  k )   kj ( X ;kj , )
j 1
• The Posterior
 k  kj ( X ;kj , )
j 1
p (G  k | X  x ) 
   ( X ; , )
l
l
j 1
lj lj
22
• Maximum likelihood
K
 
 log  k  kj ( xi ;kj , ) 
k 1 g  k  j 1 
23
FDA and MDA
24
Waveform Signal with Additive Gaussian
Noise
Class 1: Xj = U h1(j) + (1-U)h2(j) +j
Class 2: Xj = U h1(j) + (1-U)h3(j) +j
Class 3: Xj = U h2(j) + (1-U)h3(j) +j
Where j = 1,L, 21, and U ~ Unif(0,1)
h1(j) = max(6-|j-11|,0)
h2(j) = h1(j-4)
h3(j) = h1(j+4)
25
Wave From Data Results
26
The End
27

Talk12 FlexibleDA 2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Talk12 FlexibleDA 2014

Uploaded by

Copyright:

Available Formats

Generalized Discriminant

Bayes theorem give us the discriminant:

• Comparing two classes k and l , assume k  , k

• Simple decision rules based on linear decision boundaries

• Provides low dimensional view of data

– May have many (correlated) predictors for digitized

• Penalized Discriminant Analysis (PDA)

• Mixture Discriminant Analysis (MDA)

• Linear regression on  Obtain mutually linear

non-parametric fits, e.g.,

generalized additive fits, spline

fitted regression functions.

3. Update the model from step 1 using the optimal scores:( x)  Θ  ( x) .

• Choice of Ω depends on problems, if we use

with W as within-class Cov

• Decompose the classification subspace using

• Could use penalized estimation

Class 1: Xj = U h1(j) + (1-U)h2(j) +j

Class 2: Xj = U h1(j) + (1-U)h3(j) +j

Class 3: Xj = U h2(j) + (1-U)h3(j) +j

Where j = 1,L, 21, and U ~ Unif(0,1)

You might also like