Duda ch10

0
9. Hierarchical Clustering
• Many times, clusters are not disjoint, but may have
subclusters, in turn having sub-subclusters, etc.
• Consider a sequence of partitions of the n samples
into c clusters
• The first is a partition into n clusters, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10

1
• Given any two samples x and x’, they will be grouped

together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
• Hierarchical clustering  tree called dendrogram

• The similarity values may help to determine if the
2
grouping are natural or forced, but if they are

evenly distributed no information can be gained
• Another representation is based on Venn diagrams

3
• Hierarchical clustering can be divided in

agglomerative and divisive.
• Agglomerative (bottom up, clumping): start with

n singleton cluster and form the sequence by
merging clusters
• Divisive (top down, splitting): start with all of the

samples in one cluster and form the sequence
by successively splitting clusters

4
Agglomerative hierarchical clustering
• The procedure terminates when the specified number of

cluster has been obtained, and returns the clusters as sets
of points, rather than the mean or a representative vector
for each cluster. Terminating at c = 1 yields a dendrogram.
5
• At any level, the distance between nearest clusters

can provide the dissimilarity value for that level
• To find the nearest clusters, one can use
d min ( Di , D j )  min x  x' Nearest neighbor
x D i , x ' D j
d max ( Di , D j )  max x  x' Farthest neighbor

xDi , x 'D j
1
d avg ( Di , D j ) 
ni n j
  x  x'
xDi x 'D j
d mean ( D i , D j )  m i  m j
which behave quite similar if the clusters are
hyperspherical and well separated.

6
Nearest-neighbor algorithm
• When dmin is used, the algorithm is called the
nearest neighbor algorithm
• If it is terminated when the distance between
nearest clusters exceeds an arbitrary threshold,
it is called single-linkage algorithm
• If data points are considered nodes of a graph
with edges forming a path between the nodes in
the same subset Di, the merging of Di and Dj
corresponds to adding an edge between the
neirest pair of node in Di and Dj
• The resulting graph has any closed loop and it is
a tree, if all subsets are linked we have a
spanning tree
7
• The use of dmin as a distance measure and

the agglomerative clustering generate a
minimal spanning tree
• Chaining effect: defect of this distance

measure (right side of figure next slide)

8
Problem!

9
The farthest neighbor algorithm

• When dmax is used, the algorithm is called the
farthest neighbor algorithm
• If it is terminated when the distance between
nearest clusters exceeds an arbitrary threshold, it
is called complete-linkage algorithm
• This method discourages the growth of elongated
clusters
• In the terminology of graph theory, every cluster
constitutes a complete subgraph, and the distance
between two clusters is determined by the most
distant nodes in the 2 clusters
10
• When two clusters are merged, the graph is

changed by adding edges between every pair
of nodes in the 2 clusters
• All the procedures involving minima or

maxima are sensitive to outliers. The use of
dmean or davg are natural compromises

11

12
The problem of the number of clusters
• Typically, the number of clusters is known

• When it’s not, there are several ways of proceed
• When clustering is done by extremizing a criterion function,
a common approach is to repeat the clustering with c=1,
c=2, c=3, etc.
• Another approach is to state a threshold for the creation of
a new cluster
• These approaches are similar to model selection
procedures, typically used to determine the topology and
number of states (e.g., clusters, parameters) of a model,
given a specific application

13
13. Component Analysis (also Chap 3)
• Combine features to reduce the dimension of the feature space

• Linear combinations are simple to compute and tractable
• Project high dimensional data onto a lower dimensional space
• Two classical approaches for finding “optimal” linear
transformation
• PCA (Principal Component Analysis) “Projection that best represents the
data in a least- square sense”
• MDA (Multiple Discriminant Analysis) “Projection that best separates the
data in a least-squares sense” (generalization of Fisher’s Linear
Discriminant for two classes)

14
• PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
• The scatter matrix of the cloud of samples is the same as
the maximum-likelihood estimate of the covariance matrix
• Unlike a covariance matrix, however, the scatter matrix includes
samples from all classes!
• And the least-square projection solution (maximum scatter)

is simply the subspace defined by the d’<d eigenvectors of
the covariance matrix that correspond to the largest d’
eigenvalues of the matrix
• Because the scatter matrix is real and symmetric the eigenvectors
are orthogonal

15
Fisher Linear Discriminant

• While PCA seeks directions efficient for representation, discriminant
analysis seeks directions efficient for discrimination

16
Multiple Discriminant Analysis

• Generalization of Fisher’s Linear Discriminant for more than two classes

Duda ch10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Duda ch10

Uploaded by

Copyright:

Available Formats

0

Pattern Classification, Chapter 10

• Given any two samples x and x’, they will be grouped

Pattern Classification, Chapter 10

grouping are natural or forced, but if they are

Pattern Classification, Chapter 10

• Hierarchical clustering can be divided in

• Agglomerative (bottom up, clumping): start with

• Divisive (top down, splitting): start with all of the

Pattern Classification, Chapter 10

Agglomerative hierarchical clustering

• The procedure terminates when the specified number of

• At any level, the distance between nearest clusters

d max ( Di , D j )  max x  x' Farthest neighbor

Pattern Classification, Chapter 10

• The use of dmin as a distance measure and

• Chaining effect: defect of this distance

Pattern Classification, Chapter 10

Pattern Classification, Chapter 10

The farthest neighbor algorithm

• When two clusters are merged, the graph is

• All the procedures involving minima or

Pattern Classification, Chapter 10

Pattern Classification, Chapter 10

The problem of the number of clusters

• Typically, the number of clusters is known

Pattern Classification, Chapter 10

13. Component Analysis (also Chap 3)

• Combine features to reduce the dimension of the feature space

Pattern Classification, Chapter 10

• PCA (Principal Component Analysis) “Projection that

• And the least-square projection solution (maximum scatter)

Pattern Classification, Chapter 10

Fisher Linear Discriminant

Pattern Classification, Chapter 10

Multiple Discriminant Analysis

Pattern Classification, Chapter 10

You might also like