10 1 1 41

To appear, AAAI-2000: Workshop of Arti
ial Intelligen e for Web Sear h, July 2000
Impa t of Similarity Measures on Web-page Clustering

Alexander Strehl, Joydeep Ghosh, and Raymond Mooney
The University of Texas at Austin, Austin, TX, 78712-1084, USA

Email: strehle e.utexas.edu, ghoshe e.utexas.edu, and mooney s.utexas.edu
Abstra t
Clustering of web do uments enables (semi-)automated

ategorization, and fa ilitates ertain types of sear h.
Any lustering method has to embed the do uments
in a suitable similarity spa e. While several lustering
methods and the asso iated similarity measures have
been proposed in the past, there is no systemati omparative study of the impa t of similarity metri s on
luster quality, possibly be ause the popular ost riteria do not readily translate a ross qualitatively dierent metri s. We observe that in domains su h as Yahoo that provide a ategorization by human experts,
a useful riteria for omparisons a ross similarity metri s is indeed available. We then ompare four popular similarity measures (Eu lidean, osine, Pearson
orrelation and extended Ja ard) in onjun tion with
several lustering te hniques (random, self-organizing
feature map, hyper-graph partitioning, generalized kmeans, weighted graph partitioning), on high dimensional sparse data representing web do uments. Performan e is measured against a human-imposed lassi ation into news ategories and industry ategories.
We ondu t a number of experiments and use t-tests
to assure statisti al signi an e of results. Cosine and
extended Ja ard similarities emerge as the best measures to apture human ategorization behavior, while
Eu lidean performs poorest. Also, weighted graph partitioning approa hes are learly superior to all others.
Introdu tion
The in reasing size and dynami ontent of the world

wide web has reated a need for automated organization of web-pages. Do ument lusters an provide a
stru ture for organizing large bodies of text for e ient
browsing and sear hing. For this purpose, a web-page is
typi ally represented as a ve tor onsisting of the suitably normalized frequen y ounts of words or terms.
Ea h do ument ontains only a small per entage of all
the words ever used in the web. If we onsider ea h
do ument as a multi-dimensional ve tor and then try
to luster do uments based on their word ontents, the
problem diers from lassi lustering s enarios in several ways. Do ument lustering data is high dimen-
Copyright 2000, Ameri an Asso iation for Arti ial Intelligen e (www.aaai.org). All rights reserved.
sional, hara terized by a highly sparse word-do ument

matrix with positive ordinal attribute values and a signi ant amount of outliers.
Clustering has been widely studied in several dis iplines, spe ially sin e the early 60's (Hartigan 1975).
Some lassi approa hes in lude partitional methods
su h as k-means, hierar hi al agglomerative lustering, unsupervised Bayes, and soft, statisti al me hani s
based te hniques. Most lassi al te hniques, and even
fairly re ent ones proposed in the data mining ommunity (Clarans, Dbs an, Bir h, Clique, Cure,
WaveCluster et . (Rastogi & Shim 1999)), are based
on distan es between the samples in the original ve tor spa e. Thus they are fa ed with the \ urse of dimensionality" and the asso iated sparsity issues, when
dealing with very high dimensional data. Indeed, often,
the performan e of su h lustering algorithms is demonstrated only on illustrative 2{dimensional examples.
When do uments are represented by a bag of words,
the resulting do ument-word matrix typi ally represents data in 1000+ dimensions. Several noteworthy
attempts have emerged to e iently luster do uments
that are represented in su h high dimensional spa e1 . In
(Dhillon & Modha 1999), the authors present a spheri al k-means algorithm for do ument lustering. Graphbased lustering approa hes, that attempt to avoid the
urse of dimensionality by transforming the problem
formulation in lude (Karypis, Han, & Kumar 1999;
Boley et al. 1999; Strehl & Ghosh 2000). Note that
su h methods use a variety of similarity (or distan e)
measures, literature, and we are unaware of any solid
omparative study a ross dierent similarity measures.
In this paper, we rst ompare similarity measures
analyti ally and illustrate their semanti s geometri ally. Se ondly, we propose an experimental methodology to ompare high dimensional lusterings based on
mutual information, entropy, and purity. We ondu t
a series of experiments on Yahoo news pages to evaluate the performan e and luster quality of four simi1 There is also substantial work on ategorizing su h do uments. Here, sin e at least some of the do uments have
labels, a variety of supervised or semi-supervised te hniques
an be used (Mooney & Roy 1999; Yang 1999)
To appear, AAAI-2000: Workshop of Arti ial Intelligen e for Web Sear h, July 2000
X
X1111
0000
1111
0000
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
x1
xn
neuron for ea h sample. The omplexity of this in remental algorithm is O(ndk m) and mostly determined
by the number of epo hs m and samples n.
Generalized
Figure 1: Overview of a similarity based lustering

framework.
larity measures (Eu lidean, osine, Pearson orrelation,
extended Ja ard) in ombination with ve algorithms
(random, self-organizing feature map, hyper-graph partitioning, generalized k-means, weighted graph partitioning).
Let n be the number of obje ts (web-pages) in the
data and d the number of features (words, terms) for
ea h sample x with j 2 f1; : : : ; ng. The input data an
be represented by a dn word-do ument matrix X with
the j -th olumn representing the sample x . Hard lustering 2 assigns a label to ea h d{dimensional sample
x , su h that similar samples tend to get the same label. The number of distin t labels is k, the desired
number of lusters. In general the labels are treated
as nominals with no inherent order, though in some
ases, su h as self-organizing feature maps (Sofms) or
top-down re ursive graph-bise tion, the labeling may
ontain extra ordering information. Let C denote the
set of all obje ts in the `-th luster (` 2 f1; : : : ; kg),
with x 2 C , = ` and n = jC j. Figure 1 gives an
overview of a bat h lustering pro ess from a set of raw
obje t des riptions X via the ve tor spa e des ription
X and similarity spa e des ription S to the luster la (X 2 F R ) !
(S 2 S =
bels : (X 2 I ) !
( 2 O = f1; : : : ; kg ). The
[0; 1 R ) !
next se tion brie y des ribes the ompared algorithms.
j
Algorithms
Random Baseline
As a baseline for omparing algorithms, we use lustering labels drawn from a uniform random distribution
over the integers from 1 to k. The omplexity of this
algorithm is O(n).
Self-organizing Feature Map
We use a 1{dimensional Sofm as proposed by Kohonen

(Kohonen 1995). To generate k lusters we use k ells
in a line topology and train the network for m = 5000
epo hs or 10 minutes (whi hever omes rst). All experiments are run on a dual pro essor 450 MHz Pentium
and for this lustering te hnique we use the Sofm implementation in the Matlab neural network tool-box.
The resulting network is subsequently used to generate
the label ve tor from the index of the most a tivated
2 In soft lustering, a re ord an belong to multiple lus-
ters with dierent degrees of \asso iation" (Kumar & Ghosh

1999).
k -means
We also employed the well-known k-means algorithm

and three variations of it using non-Eu lidean distan e
measures. The k-means algorithm is an iterative algorithm to minimize the least squares error riterion
(Duda & Hart 1973). A luster C is represented by
its enter , the mean of all samples in C . The enters are initialized with a random sele tion of k data
obje ts. Ea h sample is then labeled with the index i
of the nearest or most similar enter. In the following
subse tions we will des ribe four dierent semanti s for
loseness or similarity s(x ; x ) of two obje ts x and
x . Subsequent re- omputing of the mean for ea h luster and re-assigning the luster labels is iterated until
onvergen e to a xed labeling after m iterations. The
omplexity of this algorithm is O(n d k m).
`
Weighted Graph Partitioning
The obje ts to be lustered an be viewed as a set of

verti es V . Two web-pages x and x (or verti es v and
v ) are onne ted with an undire ted edge of positive
weight s(x ; x ), or (a; b; s(x ; x )) 2 E . The ardinality of the set of edges jEj equals the number of non-zero
similarities between all pairs of samples. A set of edges
whose removal partitions a graph G = (V ; E ) into k pairwise disjoint sub-graphs G = (V ; E ), is alled an edge
separator. Our obje tive is to nd su h a separator with
a minimum sum of edge weights. While striving for the
minimum ut obje tive, the number of obje ts in ea h
luster has to be kept approximately equal. We de ided
to use Opossum (Strehl & Ghosh 2000), whi h produ es balan ed (equal sized) lusters from the similarity matrix using multi-level multi- onstraint graph partitioning (Karypis & Kumar 1998). Balan ed lusters
are desirable be ause ea h luster represents an equally
important share of the data. However, some natural
lasses may not be equal size. By using a higher number of lusters we an a ount for multi-modal lasses
(e.g., Xor-problem) and lusters an be merged at a
latter stage. The most expensive step in this O(n2 d)
te hnique is the omputation of the n n similarity
matrix. In do ument lustering, sparsity an be indu ed by looking only at the v strongest edges or at
the subgraph indu ed by pruning all edges ex ept the v
nearest-neighbors for ea h vertex. Sparsity makes this
approa h feasible for large data-sets. In web-page lustering sparsity is indu ed by all non-Eu lidean similarities proposed in this paper, and may be in reased by a
thresholding riterion.
a
Hyper-graph Partitioning
A hyper-graph is a graph whose edges an onne t more

than two verti es (hyper-edges). The lustering problem is then formulated as a nding the minimum- ut of
a hyper-graph. A minimum- ut is the removal of the set
of hyper-edges (with minimum edge weight) that separates the hyper-graph into k un onne ted omponents.
Again, an obje t x maps to a vertex v . Ea h word
(feature) maps to a hyper-edge onne ting all verti es
with non-zero frequen y ount of this word. The weight
of this hyper-edge is hosen to be the total number of
o urren es in the data-set. Hen e, the importan e of
a hyper-edge during partitioning is proportional to the
o urren e of the orresponding word. The minimum ut of this hyper-graph into k un onne ted omponents
gives the desired lustering. We employ the hMetis
pa kage for partitioning. An advantage of this approa h is that the lustering problem an be mapped
to a graph problem without the expli it omputation
of similarity, whi h makes this approa h omputationally e ient with O(n d k) assuming a ( lose to) linear
performing hyper-graph partitioner. However, samplewise frequen y information gets lost in this formulation
sin e there is only one weight asso iated with a hyperedge.
j
Similarity Measures
Metri Distan es
The
Minkowski
distan es L (x ; x )
=
x j
are the standard metri s
=1 jx
for geometri al problems. For p = 1 (p = 2) we
obtain the Manhattan (Eu lidean) distan e. For
Eu lidean spa e, we hose to relate distan es d
and similarities s using s = e . Consequently,
we dene Eu lidean [0; 1 normalized similarity as
s(E) (x ; x ) = e kxa xb k whi h has important
properties (as we will see in the dis ussion) that the
ommonly adopted s(x ; x ) = 1=(1 + kx x k2 )
la ks.
P
i;a
i;b
1
=p
2
2
Cosine Measure
Similarity an also be dened by the angle or osine

of the angle between two ve tors.
The osine measure
xya xb
(C)
is given by s (x ; x ) = kxa k kxb k and aptures a
s ale invariant understanding of similarity. An even
stronger property is that the osine similarity does not
depend on the length: s(C) (x ; x ) = s(C) (x ; x ) for
> 0. This allows do uments with the same omposition, but dierent totals to be treated identi ally
whi h makes this the most popular measure for text
do uments. Also, due to this property, samples an be
normalized to the unit sphere for more e ient pro essing (Dhillon & Modha 1999).
a
Pearson Correlation
In ollaborative ltering, orrelation is often used to

predi t a feature from a highly similar mentor group
of obje ts whose features are known. The [0; 1 normalized Pearson orrelation is dened as s(P)(x ; x ) =
a
1 (xa a )y (xb b ) + 1 , where x denotes the average

2 kxa a k2 kxb b k2
x
feature value of x over all dimensions.

Extended Ja ard Similarity
The binary Ja ard oe ient measures the ratio of the

number of shared attributes (words) of x and x to
the number possessed by x or x . It is often used
in retail market-basket appli ations. Ja ard similarity
an be extended to ontinuous or dis rete
non-negative
y
features using s(J)(x ; x ) = kxa k +xkxa xb kb xya xb (Strehl
& Ghosh 2000).
a
Dis ussion
2
2
2
2
Clearly, if lusters are to be meaningful, the similarity

measure should be invariant to transformations natural to the problem domain. Also, normalization may
strongly ae t lustering in a positive or negative way.
The features have to be hosen arefully to be on omparable s ales and similarity has to re e t the underlying semanti s for the given task.
Eu lidean similarity is translation invariant but s ale
variant while osine is translation variant but s ale invariant. The extended Ja ard has aspe ts of both
properties as illustrated in gure 2. Iso-similarity lines
at s = 0:25, 0:5 and 0:75 for points x1 = (3 1)y and
y
x2 = (1 2) are shown for Eu lidean, osine, and the
extended Ja ard. For osine similarity only the 4 (out
of 12) lines that are in the positive quadrant are plotted. The dashed line marks the lo us of equal similarity
to x1 and x2 whi h always passes through the origin for
osine and extended Ja ard similarity.
In Eu lidean spa e, iso-similarities are on entri
hyper-spheres around the onsidered sample point (ve tor). Due to the nite range of similarity, the radius
de reases hyperboli ally as s in reases linearly. The
radius is onstant for a given similarity regardless of
the enter-point. The only lo ation with similarity of
1 is the onsidered point itself and no lo ation at nite distan e has a similarity of 0 (no sparsity). Using the osine measure renders the iso-similarities to
be hyper- ones all having their apex at the origin and
axis aligned with the given sample (ve tor). Lo ations
with similarity 1 are on the 1{dimensional sub-spa e
dened by this axis and the lo us of points with similarity 0 is the hyper-plane perpendi ular to this axis.
For the extended Ja ard similarity, the iso-similarities
are non- on entri hyper-spheres. The only lo ation
with s = 1 is the point itself. The hyper-sphere radius in reases with the the distan e of the onsidered
point from the origin so that longer ve tors turn out
to be more tolerant in terms of similarity than smaller
ve tors. Sphere radius also in reases with similarity
and as s approa hes 0 the radius be omes innite. The
resulting iso-similarity surfa e is the hyper-plane perpendi ular to the onsidered point through the origin.
Thus, for s ! 0, extended Ja ard behaves like the osine measure, and for s ! 1, behaves like the Eu lidean
distan e.
Euclidean
Cosine
Extended Jaccard
where the pages are ategorized (labelled) by an external sour e, there is a plausible way out! Given g ategories ( lasses) K (h 2 f1; : : : ; gg, x 2 K , = h
), we use the \true" lassi ation labels to evaluate
the performan e. For evaluating a single luster, we
use purity and entropy, while the entire lustering is
evaluated using mutual information.
Let n( ) denote the number of obje ts in luster C
that are lassied to be h as given by . Cluster C 's
purity an be dened as
(P) (C ) = 1 max(n( ) ):
(3)
h
Figure 2: Properties of various similarity measures.

The extended Ja ard adopts the middle ground between Eu lidean and osine based similarity.
In traditional Eu lidean k-means lustering the optimal luster representative minimizes the sum of
squared error (Sse) riterion, i.e.
X
= arg min
kx zk22 :
(1)
z2F
`
xj 2C`
In the following, we proof how this onvex distan ebased obje tive an be translated and extended to
similarity spa e. Consider the generalized obje tive
fun tion f (C ; z) given
a representaP a luster C and P
tive z: f (C ; z) = xj 2C` d(x ; z)2 = xj 2C` kx
2
zk2 .
Mapping
P from distan es to similarities yields
log(s(x ; z)) , and therefore
f (C ; z) =
xQ
j 2C`
f (C ; z) = log xj 2C` s(x ; z) . Finally, we transform the obje tive using a stri tly monotoni de reasing
fun tion: Instead of minimizing f (C ; z), we maximize
f 0 (C ; z) = e (C` z) . Thus, in similarity spa e S , the
least squared error representative 2 F for a luster
C satises
Y
= arg max
s(x ; z):
(2)
z2F
`
xj 2C`
Using the on ave evaluation fun tion f 0 , we an obtain optimal representatives for non-Eu lidean similarity spa es. The values of the evaluation fun tion
f 0 (fx1 ; x2 g; z) are used to shade the ba kground in
gure 2. In a maximum likelihood interpretation, we
onstru ted the distan e similarity transformation su h
that p(zj ) s(z; ). Consequently, we an use the
dual interpretations of probabilities in similarity spa e
and errors in distan e spa e.
`
Experimental Evaluation
Methodology
We ondu ted experiments with all ve algorithms, using four variants ea h for k-means and graph partitioning, yielding eleven te hniques in total. Sin e lustering
is unsupervised, su ess is generally measured by a ost
riterion. Standard ost fun tions, su h as the sum of
squared distan es from luster representative depend
on the the similarity (or distan e) measure employed.
They annot be used to ompare te hniques that use
dierent similarity measures. However, in situations
Purity an be interpreted as the lassi ation rate under the assumption that all samples of a luster are
predi ted to be members of the a tual dominant lass
for that luster. Alternatively, we also use [0; 1 entropy,
whi h is dened for a g lass problem as
( )!
n
log n =log(g):
(E)(C ) =
=1 n
X n( )
g
(4)
Entropy is a more omprehensive measure than purity

sin e rather than just onsidering the number of obje ts
\in" and \not in" the most frequent lass, it onsiders
the entire distribution.
While the above two riteria are suitable for measuring a single luster's quality, they are biased to favor
smaller lusters. In fa t, for both these riteria, the
globally optimal value is trivially rea hed when ea h
luster is a single sample! Consequently, for the overall (not luster-wise) performan e evaluation, we use a
measure based on mutual information:

(h)
`
X X ( ) log Pki ih Pgi `i
1
(M)
(; ) = n
n
log(k g)
=1 =1
(5)
Mutual information is a symmetri measure for the degree of dependen y between the lustering and the ategorization. Unlike orrelation, mutual information also
takes higher order dependen ies into a ount. We use
the symmetri mutual information riterion be ause it
su essfully aptures how related the labeling and ategorizations are without a bias towards smaller lusters.
Sin e these performan e measures are ae ted by the
distribution of the data (e.g., a priori sizes), we normalize the performan e by that of the orresponding
random lustering, and interpret the resulting ratio as
\performan e lift".
n
( )
=1
( )
=1
Findings on Industry Web-page Data
From the Yahoo industry web-page data (Cmu Web

KB Proje t (Craven et al. 1998)), the following ten
industry se tors were sele ted: airline, omputer
hardware, ele troni instruments and ontrols,
forestry and wood produ ts,
gold and silver,
mobile homes and rvs,
oil well servi es and
equipment, railroad, software and programming,
tru king. Ea h industry ontributes about 10% of
the pages. The frequen ies of 2896 dierent words

that are not in a standard English stop-list (e.g., a,
and, are, . . . ) and do o ur on average between 0:01
and 0:1 times per page, were extra ted from Html.
Word lo ation was not onsidered. This data is far
less lean than e.g., the Reuters data. Do uments
vary signi antly in length, some are in the wrong
ategory, same are out-dated or have little ontent
(e.g., are mostly images). Also, the hub pages that
Yahoo refers to are usually top-level bran h pages.
These tend to have more similar bag-of-words ontent
a ross dierent lasses (e.g., onta t information,
sear h windows, wel ome messages) than news ontent
oriented pages. Sample sizes of 50, 100, 200, 400, and
800 were used for lustering 966 do uments from the
above 10 ategories. The number of lusters k was
set to 20 and ea h setting was run 10 times (ea h
te hnique gets the same data) to apture the random
variation in results.
Figure 3 shows the results of the 550 experiments.
In table 1, the t-test results indi ate that graphpartitioning with osine similarity performs best losely
followed by the other two non-Eu lidean measures.
The se ond tier are non-Eu lidean k-means variations. Hyper-graph partitioning performs reasonably
well. The Sofm and Eu lidean similarity with k-means
as well as with graph partitioning, fail to apture the
relationships of the high dimensional data.
Random
10
SOFM
10
HGP
10
200 400 600 800

kM Eucl
200 400 600 800

kM Cosi
200 400 600 800

kM Corr
kM
10
10
10
10
200 400 600 800

GP Eucl
200 400 600 800

GP Cosi
200 400 600 800

GP Corr
10
10
10
10
200 400 600 800
200 400 600 800
200 400 600 800
XJac
200 400 600 800

GP XJac
200 400 600 800
Figure 3: Performan e lift (normalized to the random

baseline) on Yahoo industry web-page data of mutual
information (M) for various sample sizes n. The bars
indi ate 2 standard deviations.
Findings on News Web-pages
The 20 original Yahoo news ategories in the data

are Business, Entertainment (no sub- ategory,
art, able, ulture, film, industry, media,

multimedia, musi , online, people, review, stage,
television, variety), Health, Politi s, Sports,
Te hnology and orrespond to = 1; : : : ; 20, re-
spe tively.
The data is publi ly available from

(K1 series) and was used in (Boley et al. 1999). The raw
218392340 word-do ument matrix onsists of the nonnormalized o urren e frequen ies of stemmed words,
using Porter's sux stripping algorithm (Frakes 1992).
Pruning all words that o ur less than 0:01 or more
than 0:10 times on average be ause they are insigni ant (e.g., abdrazakof) or too generi (e.g., new), respe tively, results in d = 2903.
ftp://ftp. s.umn.edu/dept/users/boley/
Random
SOFM
HGP
200 400 600 800

kM Eucl
200 400 600 800

kM Cosi
200 400 600 800

kM Corr
kM
200 400 600 800

GP Eucl
200 400 600 800

GP Cosi
200 400 600 800

GP Corr
200 400 600 800
200 400 600 800
200 400 600 800
XJac
200 400 600 800

GP XJac
200 400 600 800
Figure 4: Performan e lift (normalized to the random

baseline) on Yahoo news web-page data of mutual information (M) for various sample sizes n. The bars
indi ate 2 standard deviations.
Sample sizes of 50, 100, 200, 400, and 800 were used
for lustering 2340 do uments from the above 20 ategories. The number of lusters k was set to 40 and
ea h setting was run 10 times (ea h te hnique gets the
same sub-sample) to apture the random variation in
results. We hose 40 lusters, two times the number
of ategories, sin e this seemed to be the more natural number of lusters as indi ated by preliminary runs
and visualization. Using a greater number of lusters
than lasses an be viewed as allowing multi-modal distributions for ea h lasses. For example, in an Xor like
problem, there are two lasses, but four lusters. 550
lustering runs were ondu ted (gure 4) and the results were evaluated in 55 one-sided t-tests (table 2) for
the n = 800 sample level.
Non-Eu lidean graph partitioning approa hes work
best on the data. The top performing similarity measures are extended Ja ard and osine. We initially expe ted osine to perform better than the extended Ja -
GP Cosi
GP Corr
GP XJa
kM XJa
kM Cosi
kM Corr
HGP
SOFM
kM Eu l
GP Eu l
Random
(M)
0.192
0.177
0.177
0.168
0.167
0.167
0.107
0.061
0.039
0.034
0.021
GP Cosi
-
GP Corr
0.998
-
GP XJa
0.997
-
kM XJa
1.000
0.957
-
kM Cosi
1.000
0.967
0.956
-
kM Corr
1.000
0.968
0.958
-
HGP
1.000
1.000
1.000
1.000
1.000
1.000
-
SOFM
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
6
kM Eu l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
GP Eu l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.971
-
Random
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Table 1: Industry web-page data with n = 966, d = 2896, g = 6, and k = 20. Comparison of 10 trials of te hniques
at 800 samples in terms of (M) performan e and t-test results ( onden es below 0.950 are marked with `-').
GP XJa
GP Cosi
GP Corr
HGP
kM XJa
kM Corr
kM Cosi
SOFM
GP Eu l
Random
kM Eu l
(M)
0.240
0.240
0.234
0.185
0.184
0.178
0.178
0.150
0.114
0.066
0.046
GP XJa
-
GP Cosi
-
GP Corr
0.991
0.990
-
HGP
1.000
1.000
1.000
-
kM XJa
1.000
1.000
1.000
-
kM Corr
1.000
1.000
1.000
0.982
0.992
-
kM Cosi
1.000
1.000
1.000
0.988
0.994
-
SOFM
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
GP Eu l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Random
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
kM Eu l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Table 2: News web-page data with n = 2340, d = 2903, g = 20, and k = 40. Comparison of 10 trials of te hniques
at 800 samples in terms of (M) performan e and t-test results ( onden es below 0.950 are marked with `-').
ard and orrelation due to its length invarian e. The
middle ground viewpoint of extended Ja ard seems
to be su essful in web-page as well as market-basket
appli ations. Correlation is only marginally worse in
terms of average performan e. Hyper-graph partitioning is in the third tier, outperforming all generalized kmeans algorithms ex ept for the extended Ja ard. All
Eu lidean te hniques in luding Sofm performed very
poorly. Surprisingly, Sofm and graph partitioning were
still able to do signi antly better than random despite
the limited expressiveness of Eu lidean similarity. Eu lidean k-means performed even worse than random in
terms of entropy and equivalent to random in terms of
purity (not shown).
Table 3 shows the results of the best performing
Opossum lustering (Strehl & Ghosh 2000). For ea h
luster the dominant ategory, its purity and entropy
are given along with the top three des riptive and dis riminative words. Des riptiveness is dened as o urren e frequen y (a notion similar to singular item-set
support). Dis riminative terms for a luster have the
highest o urren e multipliers ompared to the average do ument (similar to the notion of singular itemset lift). Unlike the Yahoo ategories, whi h vary
in size from 9 to 494 pages (!), all our lusters are
well-balan ed: ea h ontains between 57 and 60 pages.
Health (KH ) turned out to be the ategory most learly
identied. This ould have been expe ted sin e its
language separates quite distin tively from the others.
However, our lustering is better than just mat hing the
Yahoo given labels, be ause distinguishes more pre isely. For example there are dete ted sub- lasses su h
as Hiv (C9 ) and geneti s related (C10 ) pages. Cluster
8, for example is des ribed by our system through the

terms va in, strain, antibiot indi ating an infe tion related luster. Similarly, in the Entertainment
people ategory, our algorithm identies a luster dealing with Prin ess Diana's ar a ident (C24 ) and funeral
(C25 ). Some smaller ategories, su h as Entertainment
no sub- ategory pages (KE ) have been absorbed into
more meaningful lusters. Most Te hnology pages are
found in luster C12 . Interestingly, do uments from
the te hnology ategory have also been grouped with a
Entertainment-online dominated and a Business dominated luster indi ating an overlap of topi s. Clusterings of this quality may be used to build a fully automated web-page ategorization engine yielding leaner
ut groups than urrently seen.
Con luding Remarks
The key ontribution of this work lies in providing a

framework for omparing several lustering approa hes
a ross a variety of similarity spa es. The results indi ate that graph partitioning is better suited for word
frequen y based lustering of web do uments than generalized k-means, hyper-graph partitioning, and Sofm.
The sear h pro edure impli it in graph partitioning
is far less lo al than the hill- limbing approa h of kmeans. Moreover, it also provides a way to obtain balan ed lusters and exhibit a lower varian e in results.
Metri distan es su h as Eu lidean are not appropriate for high dimensional, sparse domains. Cosine,
orrelation and extended Ja ard measures are su essful in apturing the similarities impli itly indi ated by
manual ategorizations as seen for example in Yahoo.
A knowledgments: This resear h was supported in
C` Kh^
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
H
H
H
H
H
H
H
H
H
H
o
T
P
P
P
S
S
i
B
B
p
f
u
p
p
mu
t
mu
p
p
f
f
r
r
r
S
t
t
f
t
(P)
45.76%
35.09%
74.14%
88.14%
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
60.00%
63.79%
18.97%
56.90%
84.75%
88.14%
70.18%
43.10%
73.33%
82.76%
54.24%
26.32%
46.55%
64.41%
45.76%
22.41%
23.73%
53.45%
68.33%
32.76%
77.59%
76.27%
43.10%
64.41%
93.22%
48.28%
39.66%
55.17%
76.67%
70.69%
(E)
0.60533
0.68312
0.19081
0.15943
0
0
0
0
0
0
0.34318
0.44922
0.81593
0.47765
0.20892
0.17215
0.35609
0.59554
0.28802
0.24307
0.52572
0.73406
0.59758
0.43644
0.59875
0.69697
0.69376
0.39381
0.29712
0.63372
0.30269
0.30759
0.49547
0.37305
0.10628
0.48187
0.51945
0.42107
0.29586
0.34018
top 3 des riptive terms

abus, label, addi t
sui id, israel, jet
surgeri, arteri, kidnei
fda, safeti, seate
smok, smoker, lung
weight, pregnan , obes
breast, vitamin, diet
va in, strain, antibiot
hiv, depress, immun
mutat, genet, protein
apple, intel, ele tron
java, advertis, sun
miami, f , ontra t
appeal, suprem, justi
republi an, ommitte, reform
smith, oa h, marlin
goal, yard, pass
usa, murdo h, hannel
ent, quarter, revenu
dow, greenspan, rose
notabl, anadian, magazin
opera, bing, draw
bestsell, weekli, hard ov
rash, paparazzi, pari
funer, royal, prin e
meredith, lassi , spi e
radio, prodigi, station
on ert, band, stage
showbiz, a ademi, south
albert, stone, tour
s ript, miramax, sequel
ast, shoot, opposit
tom, theater, writer
s ript, s ot, tom
amera, pi , sound
japanes, se, hingi
nomin, hbo, winner
king, sit om, dreamwork
weekend, gross, movi
household, timeslot, slot
top 3 dis riminative terms

m kinnei, grammer, addi t
bro oli, tra tor, weizman
vander, stent, pippen
artilag, latex, fda
nonsmok, lozapin, prostat
insulin, alor, heparin
miner, fatti, estrogen
aureu, van omy in, in uenza
hemotherapi, hiv, radiosurgeri
hromosom, mutat, prion
gorman, ibm, ompaq
nader, lu ent, java
panama, pirat, trump
iraqi, nato, suprem
teamster, government, reno
oriol, homer, marlin
tou hdown, defenseman, yard
biondi, via om, pearson
ahmanson, loral, gm
dow, greenspan, treasuri
stamp, notabl, polanski
bing, pageant, lange
hard ov, paperba k, bestsell
mer ed, stephan, manslaught
bu kingham, grief, spen er
burgess, meredith, espn
yber ast, prodigi, fo
bowie, ballad, solo
ape, showbiz, alendar
jagger, marv, for ibl
sequel, bon, ameron
showtim, ast, duvall
usa k, selle k, rep
ni hola, horse, ira
juliett, narr, ostum
pors he, hingi, quartern
miniseri, hbo, kim
winfrei, dreamwork, oprah
gross, weekend, monti
denot, timeslot, datelin
19
20
23
22
31
32
39
18
26
28
11
21
24
25
29
30
33
34
35
27
37
38
40
1
2
3
4
5
6
7
8
9
10
13
14
15
16
17
36
12
B Ea
- - - 11
- - 3
- - - - - 21
10 1 - - 3
- 12
8 - - - 4
- - 2
- 1 - - - - 1
- - - - - - - - - - - - 1
- - 3 - 1
- 11
- - 2 - - - - - - - - - - - - - 3 22
11 - 2
5 - - - 1 - - - 7 - -
44
48

3
2
3
7
4
2
1
1
6
1
1
1
2
5
2
2
1
u f i m
- - 6 - - 3 2
27 5 1 2
8 15 1 1
- 45 1 - 45 - - 46 - 2 - 25 3
- 9 1 1 1 - - - 2 6 6 2 3 4 1 4
3 5 - 1 7 - 5 9 - 1
2 19 2 - 8 1 - 2 1 6 5 3 1
- 16 1 - 5 2 - 5 - 1 1 - 1 2 1 1
- - - - - - - - - - - - - - - - - - - - - - - - 3 1 8 2
2 - - 1
1 - 1 1
1 1 3 - - 3 1 16 - - - 1 2
7
mm mu
- - 1 1
- 4
- - - - 2
1 13
- 31
- - 4
- 2
- 8
- 11
1 14
- 1 - 2 12
- 2
- - 1 2
- 3
- - - - - - - - - 7
- - - 6 5
- 3
1 1
o p r s
- 1 - - - - - 10 4 3 7 6 3
1 2 - - 2 - 2
- 5 - - 1 - - 11 5 1
- 3 19 36 - 2 32 1 - 38 - 1 27 2 1
- 41 - - 19 3 - 4 25 2
- 9 38 1 - 55 10 3 - 2
- 1 - 1
- 2 - - 1 - 5
3 10 - - 6 - - - - - 1 - - - - - - - - - - - - - - - - - - - 1 3 - 1
3 1 - - 1 - - - - - 1 - - 5 - 4 1 - -
t
1
2
2
5
3
4
4
4
8
2
5
7
1
1
-
v
4
3
2
3
2
2
1
3
1
14 1
23 8
32 15
41 5
5 6 - - - - - - - - 7 2
3 - 2 1 1 2
3 -
H
-
27
20
43
52
59
58
59
58
60
58
P
2
4
10
4
-
11
33
50
S
1
2
3
15
-
T
8
1
14
-
- 52
- 40
- 28
- - 37
Table 3: Best lustering for k = 40 using Opossum and extended Ja ard similarity on 2340 Yahoo news pages.
Cluster evaluations, their des riptive and dis riminative terms (left) as well as the onfusion matrix (right).
part by NSF grant ECS-9900353. We thank Inderjit
Dhillon for helpful omments.
Referen es
Boley, D.; Gini, M.; Gross, R.; Han, E.; Hastings, K.;
Karypis, G.; Kumar, V.; Mobasher, B.; and Moore, J.
1999. Partitioning-based lustering for web do ument
ategorization. De ision Support Systems 27:329{341.
Craven, M.; DiPasquo, D.; Freitag, D.; M Callum, A.;
Mit hell, T.; Nigam, K.; and Slattery, S. 1998. Learning to extra t symboli knowledge from the world wide
web. In AAAI98, 509{516.
Dhillon, I. S., and Modha, D. S. 1999. Con ept de ompositions for large sparse text data using lustering.
Te hni al Report RJ 10147, IBM Almaden Resear h
Center. To appear in Ma hine Learning.
Duda, R. O., and Hart, P. E. 1973. Pattern Classi ation and S ene Analysis. New York: Wiley.
Frakes, W. 1992. Stemming algorithms. In Frakes,
W., and Baeza-Yates, R., eds., Information Retrieval:
Data Stru tures and Algorithms. New Jersey: Prenti e
Hall. 131{160.
Hartigan, J. A. 1975. Clustering Algorithms. New
York: Wiley.
Karypis, G., and Kumar, V. 1998. A fast and
high quality multilevel s heme for partitioning irreg-
ular graphs. SIAM Journal of S ienti Computing

20(1):359{392.
Karypis, G.; Han, E.-H.; and Kumar, V. 1999.
Chameleon: Hierar hi al lustering using dynami
modeling. IEEE Computer 32(8):68{75.
Kohonen, T. 1995. Self-Organizing Maps. Berlin, Heidelberg: Springer. (Se ond Extended Edition 1997).
Kumar, S., and Ghosh, J. 1999. GAMLS: A generalized framework for asso iative modular learning systems. In Pro eedings of the Appli ations and S ien e
of Computational Intelligen e II, 24{34.
Mooney, R. J., and Roy, L. 1999. Content-based book
re ommending using learning for text ategorization.
In Pro eedings pf the SIGIR-99 Workshop on Re ommender Systems: Algorithms and Evaluation.
Rastogi, R., and Shim, K. 1999. S alable algorithms
for mining large databases. In Han, J., ed., KDD-99
Tutorial Notes. ACM.
Strehl, A., and Ghosh, J. 2000. Value-based ustomer
grouping from large retail data-sets. In Pro eedings of
the SPIE Conferen e on Data Mining and Knowledge
Dis overy, 24-25 April 2000, Orlando, Florida, USA,
volume 4057, 33{42. SPIE. To appear.

Yang, Y. 1999. An evaluation of statisti al approa hes
to text ategorization. Journal of Information Retrieval.

10 1 1 41

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1 1 41

Uploaded by

Copyright:

Available Formats

To appear, AAAI-2000: Workshop of Arti

ial Intelligen e for Web Sear h, July 2000

Impa t of Similarity Measures on Web-page Clustering

The University of Texas at Austin, Austin, TX, 78712-1084, USA

Clustering of web do uments enables (semi-)automated

The in reasing size and dynami ontent of the world

sional, hara terized by a highly sparse word-do ument

Figure 1: Overview of a similarity based lustering

We use a 1{dimensional Sofm as proposed by Kohonen

ters with di erent degrees of \asso iation" (Kumar & Ghosh

We also employed the well-known k-means algorithm

Weighted Graph Partitioning

The obje ts to be lustered an be viewed as a set of

A hyper-graph is a graph whose edges an onne t more

Similarity an also be de ned by the angle or osine

In ollaborative ltering, orrelation is often used to

1 (xa a )y (xb b ) + 1 , where x denotes the average

feature value of x over all dimensions.

The binary Ja ard oe ient measures the ratio of the

Clearly, if lusters are to be meaningful, the similarity

Figure 2: Properties of various similarity measures.

Entropy is a more omprehensive measure than purity

Findings on Industry Web-page Data

From the Yahoo industry web-page data (Cmu Web

the pages. The frequen ies of 2896 di erent words

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

Figure 3: Performan e lift (normalized to the random

The 20 original Yahoo news ategories in the data

art, able, ulture, film, industry, media,

The data is publi ly available from

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

200 400 600 800

Figure 4: Performan e lift (normalized to the random

8, for example is des ribed by our system through the

The key ontribution of this work lies in providing a

top 3 des riptive terms

top 3 dis riminative terms

ular graphs. SIAM Journal of S ienti Computing

volume 4057, 33{42. SPIE. To appear.

You might also like

ters with dierent degrees of \asso iation" (Kumar & Ghosh

Similarity an also be dened by the angle or osine

1 (xa a )y (xb b ) + 1 , where x denotes the average

the pages. The frequen ies of 2896 dierent words