Professional Documents
Culture Documents
Abstra t
Copyright 2000, Ameri an Asso iation for Arti ial Intelligen e (www.aaai.org). All rights reserved.
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
X
X1111
0000
1111
0000
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
000
111
0000
1111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
x1
xn
neuron for ea
h sample. The
omplexity of this in
remental algorithm is O(ndk m) and mostly determined
by the number of epo
hs m and samples n.
Generalized
Algorithms
Random Baseline
As a baseline for
omparing algorithms, we use
lustering labels drawn from a uniform random distribution
over the integers from 1 to k. The
omplexity of this
algorithm is O(n).
Self-organizing Feature Map
k -means
Hyper-graph Partitioning
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
a hyper-graph. A minimum-
ut is the removal of the set
of hyper-edges (with minimum edge weight) that separates the hyper-graph into k un
onne
ted
omponents.
Again, an obje
t x maps to a vertex v . Ea
h word
(feature) maps to a hyper-edge
onne
ting all verti
es
with non-zero frequen
y
ount of this word. The weight
of this hyper-edge is
hosen to be the total number of
o
urren
es in the data-set. Hen
e, the importan
e of
a hyper-edge during partitioning is proportional to the
o
urren
e of the
orresponding word. The minimum
ut of this hyper-graph into k un
onne
ted
omponents
gives the desired
lustering. We employ the hMetis
pa
kage for partitioning. An advantage of this approa
h is that the
lustering problem
an be mapped
to a graph problem without the expli
it
omputation
of similarity, whi
h makes this approa
h
omputationally e
ient with O(n d k) assuming a (
lose to) linear
performing hyper-graph partitioner. However, samplewise frequen
y information gets lost in this formulation
sin
e there is only one weight asso
iated with a hyperedge.
j
Similarity Measures
Metri Distan es
The
Minkowski
distan
es L (x ; x )
=
x j
are the standard metri
s
=1 jx
for geometri
al problems. For p = 1 (p = 2) we
obtain the Manhattan (Eu
lidean) distan
e. For
Eu
lidean spa
e, we
hose to relate distan
es d
and similarities s using s = e . Consequently,
we dene Eu
lidean [0; 1 normalized similarity as
s(E) (x ; x ) = e kxa xb k whi
h has important
properties (as we will see in the dis
ussion) that the
ommonly adopted s(x ; x ) = 1=(1 + kx x k2 )
la
ks.
P
i;a
i;b
1
=p
2
2
Cosine Measure
Pearson Correlation
Dis ussion
2
2
2
2
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
Euclidean
Cosine
Extended Jaccard
where the pages are
ategorized (labelled) by an external sour
e, there is a plausible way out! Given g
ategories (
lasses) K (h 2 f1; : : : ; gg, x 2 K , = h
), we use the \true"
lassi
ation labels to evaluate
the performan
e. For evaluating a single
luster, we
use purity and entropy, while the entire
lustering is
evaluated using mutual information.
Let n( ) denote the number of obje
ts in
luster C
that are
lassied to be h as given by . Cluster C 's
purity
an be dened as
(P) (C ) = 1 max(n( ) ):
(3)
h
In traditional Eu
lidean k-means
lustering the optimal
luster representative
minimizes the sum of
squared error (Sse)
riterion, i.e.
X
= arg min
kx zk22 :
(1)
z2F
`
xj 2C`
In the following, we proof how this
onvex distan
ebased obje
tive
an be translated and extended to
similarity spa
e. Consider the generalized obje
tive
fun
tion f (C ; z) given
a representaP a
luster C and P
tive z: f (C ; z) = xj 2C` d(x ; z)2 = xj 2C` kx
2
zk2 .
Mapping
P from distan
es to similarities yields
log(s(x ; z)) , and therefore
f (C ; z) =
xQ
j 2C`
f (C ; z) = log xj 2C` s(x ; z) . Finally, we transform the obje
tive using a stri
tly monotoni
de
reasing
fun
tion: Instead of minimizing f (C ; z), we maximize
f 0 (C ; z) = e (C` z) . Thus, in similarity spa
e S , the
least squared error representative
2 F for a
luster
C satises
Y
= arg max
s(x ; z):
(2)
z2F
`
xj 2C`
Using the
on
ave evaluation fun
tion f 0 , we
an obtain optimal representatives for non-Eu
lidean similarity spa
es. The values of the evaluation fun
tion
f 0 (fx1 ; x2 g; z) are used to shade the ba
kground in
gure 2. In a maximum likelihood interpretation, we
onstru
ted the distan
e similarity transformation su
h
that p(zj
) s(z;
). Consequently, we
an use the
dual interpretations of probabilities in similarity spa
e
and errors in distan
e spa
e.
`
Experimental Evaluation
Methodology
We
ondu
ted experiments with all ve algorithms, using four variants ea
h for k-means and graph partitioning, yielding eleven te
hniques in total. Sin
e
lustering
is unsupervised, su
ess is generally measured by a
ost
riterion. Standard
ost fun
tions, su
h as the sum of
squared distan
es from
luster representative depend
on the the similarity (or distan
e) measure employed.
They
annot be used to
ompare te
hniques that use
dierent similarity measures. However, in situations
Purity
an be interpreted as the
lassi
ation rate under the assumption that all samples of a
luster are
predi
ted to be members of the a
tual dominant
lass
for that
luster. Alternatively, we also use [0; 1 entropy,
whi
h is dened for a g
lass problem as
( )!
n
log n =log(g):
(E)(C ) =
=1 n
X n( )
g
(4)
(h)
`
X X ( ) log Pki ih Pgi `i
1
(M)
(; ) = n
n
log(k g)
=1 =1
(5)
Mutual information is a symmetri
measure for the degree of dependen
y between the
lustering and the
ategorization. Unlike
orrelation, mutual information also
takes higher order dependen
ies into a
ount. We use
the symmetri
mutual information
riterion be
ause it
su
essfully
aptures how related the labeling and
ategorizations are without a bias towards smaller
lusters.
Sin
e these performan
e measures are ae
ted by the
distribution of the data (e.g., a priori sizes), we normalize the performan
e by that of the
orresponding
random
lustering, and interpret the resulting ratio as
\performan
e lift".
n
( )
=1
( )
=1
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
equipment, railroad, software and programming,
tru
king. Ea
h industry
ontributes about 10% of
SOFM
10
HGP
10
kM
10
10
10
10
10
10
10
10
XJac
spe tively.
Random
SOFM
HGP
kM
XJac
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
GP Cosi
GP Corr
GP XJa
kM XJa
kM Cosi
kM Corr
HGP
SOFM
kM Eu
l
GP Eu
l
Random
(M)
0.192
0.177
0.177
0.168
0.167
0.167
0.107
0.061
0.039
0.034
0.021
GP Cosi
-
GP Corr
0.998
-
GP XJa
0.997
-
kM XJa
1.000
0.957
-
kM Cosi
1.000
0.967
0.956
-
kM Corr
1.000
0.968
0.958
-
HGP
1.000
1.000
1.000
1.000
1.000
1.000
-
SOFM
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
6
kM Eu
l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
GP Eu
l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.971
-
Random
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Table 1: Industry web-page data with n = 966, d = 2896, g = 6, and k = 20. Comparison of 10 trials of te
hniques
at 800 samples in terms of (M) performan
e and t-test results (
onden
es below 0.950 are marked with `-').
GP XJa
GP Cosi
GP Corr
HGP
kM XJa
kM Corr
kM Cosi
SOFM
GP Eu
l
Random
kM Eu
l
(M)
0.240
0.240
0.234
0.185
0.184
0.178
0.178
0.150
0.114
0.066
0.046
GP XJa
-
GP Cosi
-
GP Corr
0.991
0.990
-
HGP
1.000
1.000
1.000
-
kM XJa
1.000
1.000
1.000
-
kM Corr
1.000
1.000
1.000
0.982
0.992
-
kM Cosi
1.000
1.000
1.000
0.988
0.994
-
SOFM
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
GP Eu
l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Random
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
kM Eu
l
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
-
Table 2: News web-page data with n = 2340, d = 2903, g = 20, and k = 40. Comparison of 10 trials of te
hniques
at 800 samples in terms of (M) performan
e and t-test results (
onden
es below 0.950 are marked with `-').
ard and
orrelation due to its length invarian
e. The
middle ground viewpoint of extended Ja
ard seems
to be su
essful in web-page as well as market-basket
appli
ations. Correlation is only marginally worse in
terms of average performan
e. Hyper-graph partitioning is in the third tier, outperforming all generalized kmeans algorithms ex
ept for the extended Ja
ard. All
Eu
lidean te
hniques in
luding Sofm performed very
poorly. Surprisingly, Sofm and graph partitioning were
still able to do signi
antly better than random despite
the limited expressiveness of Eu
lidean similarity. Eu
lidean k-means performed even worse than random in
terms of entropy and equivalent to random in terms of
purity (not shown).
Table 3 shows the results of the best performing
Opossum
lustering (Strehl & Ghosh 2000). For ea
h
luster the dominant
ategory, its purity and entropy
are given along with the top three des
riptive and dis
riminative words. Des
riptiveness is dened as o
urren
e frequen
y (a notion similar to singular item-set
support). Dis
riminative terms for a
luster have the
highest o
urren
e multipliers
ompared to the average do
ument (similar to the notion of singular itemset lift). Unlike the Yahoo
ategories, whi
h vary
in size from 9 to 494 pages (!), all our
lusters are
well-balan
ed: ea
h
ontains between 57 and 60 pages.
Health (KH ) turned out to be the
ategory most
learly
identied. This
ould have been expe
ted sin
e its
language separates quite distin
tively from the others.
However, our
lustering is better than just mat
hing the
Yahoo given labels, be
ause distinguishes more pre
isely. For example there are dete
ted sub-
lasses su
h
as Hiv (C9 ) and geneti
s related (C10 ) pages. Cluster
To appear, AAAI-2000: Workshop of Arti
ial Intelligen
e for Web Sear
h, July 2000
C` Kh^
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
H
H
H
H
H
H
H
H
H
H
o
T
P
P
P
S
S
i
B
B
p
f
u
p
p
mu
t
mu
p
p
f
f
r
r
r
S
t
t
f
t
(P)
45.76%
35.09%
74.14%
88.14%
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
60.00%
63.79%
18.97%
56.90%
84.75%
88.14%
70.18%
43.10%
73.33%
82.76%
54.24%
26.32%
46.55%
64.41%
45.76%
22.41%
23.73%
53.45%
68.33%
32.76%
77.59%
76.27%
43.10%
64.41%
93.22%
48.28%
39.66%
55.17%
76.67%
70.69%
(E)
0.60533
0.68312
0.19081
0.15943
0
0
0
0
0
0
0.34318
0.44922
0.81593
0.47765
0.20892
0.17215
0.35609
0.59554
0.28802
0.24307
0.52572
0.73406
0.59758
0.43644
0.59875
0.69697
0.69376
0.39381
0.29712
0.63372
0.30269
0.30759
0.49547
0.37305
0.10628
0.48187
0.51945
0.42107
0.29586
0.34018
19
20
23
22
31
32
39
18
26
28
11
21
24
25
29
30
33
34
35
27
37
38
40
1
2
3
4
5
6
7
8
9
10
13
14
15
16
17
36
12
B Ea
- - - 11
- - 3
- - - - - 21
10 1 - - 3
- 12
8 - - - 4
- - 2
- 1 - - - - 1
- - - - - - - - - - - - 1
- - 3 - 1
- 11
- - 2 - - - - - - - - - - - - - 3 22
11 - 2
5 - - - 1 - - - 7 - -
44
48
3
2
3
7
4
2
1
1
6
1
1
1
2
5
2
2
1
u f i m
- - 6 - - 3 2
27 5 1 2
8 15 1 1
- 45 1 - 45 - - 46 - 2 - 25 3
- 9 1 1 1 - - - 2 6 6 2 3 4 1 4
3 5 - 1 7 - 5 9 - 1
2 19 2 - 8 1 - 2 1 6 5 3 1
- 16 1 - 5 2 - 5 - 1 1 - 1 2 1 1
- - - - - - - - - - - - - - - - - - - - - - - - 3 1 8 2
2 - - 1
1 - 1 1
1 1 3 - - 3 1 16 - - - 1 2
7
mm mu
- - 1 1
- 4
- - - - 2
1 13
- 31
- - 4
- 2
- 8
- 11
1 14
- 1 - 2 12
- 2
- - 1 2
- 3
- - - - - - - - - 7
- - - 6 5
- 3
1 1
o p r s
- 1 - - - - - 10 4 3 7 6 3
1 2 - - 2 - 2
- 5 - - 1 - - 11 5 1
- 3 19 36 - 2 32 1 - 38 - 1 27 2 1
- 41 - - 19 3 - 4 25 2
- 9 38 1 - 55 10 3 - 2
- 1 - 1
- 2 - - 1 - 5
3 10 - - 6 - - - - - 1 - - - - - - - - - - - - - - - - - - - 1 3 - 1
3 1 - - 1 - - - - - 1 - - 5 - 4 1 - -
t
1
2
2
5
3
4
4
4
8
2
5
7
1
1
-
v
4
3
2
3
2
2
1
3
1
14 1
23 8
32 15
41 5
5 6 - - - - - - - - 7 2
3 - 2 1 1 2
3 -
H
-
27
20
43
52
59
58
59
58
60
58
P
2
4
10
4
-
11
33
50
S
1
2
3
15
-
T
8
1
14
-
- 52
- 40
- 28
- - 37
Table 3: Best
lustering for k = 40 using Opossum and extended Ja
ard similarity on 2340 Yahoo news pages.
Cluster evaluations, their des
riptive and dis
riminative terms (left) as well as the
onfusion matrix (right).
part by NSF grant ECS-9900353. We thank Inderjit
Dhillon for helpful
omments.
Referen
es
Boley, D.; Gini, M.; Gross, R.; Han, E.; Hastings, K.;
Karypis, G.; Kumar, V.; Mobasher, B.; and Moore, J.
1999. Partitioning-based
lustering for web do
ument
ategorization. De
ision Support Systems 27:329{341.
Craven, M.; DiPasquo, D.; Freitag, D.; M
Callum, A.;
Mit
hell, T.; Nigam, K.; and Slattery, S. 1998. Learning to extra
t symboli
knowledge from the world wide
web. In AAAI98, 509{516.
Dhillon, I. S., and Modha, D. S. 1999. Con
ept de
ompositions for large sparse text data using
lustering.
Te
hni
al Report RJ 10147, IBM Almaden Resear
h
Center. To appear in Ma
hine Learning.
Duda, R. O., and Hart, P. E. 1973. Pattern Classi
ation and S
ene Analysis. New York: Wiley.
Frakes, W. 1992. Stemming algorithms. In Frakes,
W., and Baeza-Yates, R., eds., Information Retrieval:
Data Stru
tures and Algorithms. New Jersey: Prenti
e
Hall. 131{160.
Hartigan, J. A. 1975. Clustering Algorithms. New
York: Wiley.
Karypis, G., and Kumar, V. 1998. A fast and
high quality multilevel s
heme for partitioning irreg-