Professional Documents
Culture Documents
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
2. Applying the triangle inequality distance between and the B th center, and this center has
Our approach to accelerating -means is based on the tri- moved only a small distance, then C is a good approxima-
angle inequality: for any three points , , and ,
tion to the updated distance.
. This is the only “black box” property that The algorithm below is the first -means variant that uses
lower bounds, as far as we know. It is also the first al-
all distance metrics possess.
gorithm that carries over varying information from one -
The difficulty is that the triangle inequality gives upper means iteration to the next. According to the authors of
bounds, but we need lower bounds to avoid calculations. (Kanungo et al., 2000): “The most obvious source of in-
Let be a point and let and be centers; we need to
know that
in order to avoid calculating efficiency in [our] algorithm is that it passes no informa-
the actual value of . tion from one stage to the next. Presumably in the later
stages of Lloyd’s algorithm, as the centers are converging
The following two lemmas show how to use the triangle to their final positions, one would expect that the vast ma-
inequality to obtain useful lower bounds. jority of the data points have the same closest center from
one stage to the next. A good algorithm would exploit this
Lemma 1: Let be a point and let and be centers. If coherence to improve running time.” The algorithm in this
then
. paper achieves this goal. One previous algorithm also re-
Proof: We know that
. So
uses information from one -means iteration in the next,
"! # . Consider the left-hand side:
but that method, due to (Judd et al., 1998), does not carry
$!
$!
%
. So over lower or upper bounds.
&
. Suppose 70 is an upper bound on the dis-
tance between and the center to which is currently
Lemma 2: Let be a point and let and be centers.
assigned, and suppose C16 G
16 is a lower bound
Then
('*) +,
-!
. . on the distance between and some other center 21 . If
70 JKC15 then JK70
LMC16
16
Proof: We know that
/ $
, so
0!
. Also, . so it is necessary to calculate neither
nor
1 .
Note that it will never be necessary in this iteration of the
Note that Lemmas 1 and 2 are true for any three points,
accelerated method to compute
16 , but it may be nec-
not just for a point and two centers, and the statement of
essary to compute exactly because of some other
Lemma 2 can be strengthened in various ways. center 1 1 for which 7-
NOC1 16 is not true.
We use Lemma 1 as follows. Let be any data point, let
be the center to which is currently assigned, and let 21 3. The new algorithm
be any other center. The lemma says that if 43 15
, then
15
. In this case, it is not nec- Putting the observations above together, the accelerated -
essary to calculate
16 . means algorithm is as follows.
Suppose that we do not know exactly, but we do First, pick initial centers. Set the lower bound C P%
know an upper bound 7 such that 7
. Then we for each point and center . Assign each to its closest
need to compute 16 and only if 798 43 15 . initial center > 9% argminQ , using Lemma 1 to
avoid redundant distance calculations. Each time
If 7#/43 ';:=<
>16 where the minimum is over all 1@%A
? ,
is computed, set C R% . Assign upper bounds
then the point must remain assigned to the center , and
70 S%T';:U< Q .
all distance calculations for can be avoided.
Next, repeat until convergence:
Lemma 2 is applied as follows. Let be any data point,
let be any center, and let 1 be the previous version of the
1. For all centers and 1 , compute 16 . For all cen-
same center. (That is, suppose the centers are numbered 1
through , and is center number B ; then 1 is center num-
ters , compute V
@% 43 ';:=< QXWDZ Y Q 1U .
ber B in the previous iteration.) Suppose that in the previous such that 7-
AV
>
.
iteration we knew a lower bound CD1 such that
15 EC1 . 2. Identify all points
Then we can infer a lower bound C for the current iteration: 3. For all remaining points and centers such that
('*)+, F
1 -!
1 . (i) [ %\? >
and
'*)+
, F C 1 !
1 .G%ECIH
( (ii) 7-
N8OC and
Informally, if C1 is a good approximation to the previous
(iii) 7-
N8]43
^
:
3a. If
then compute
> and assign beneficial because if it eliminates a point from further
% false. Otherwise, ^
%A7-
. consideration, then comparing 7-
to C for every
3b. If ^
&8TC separately is not necessary. Condition (iii) inside step (3)
or ^
&8/43 > then is beneficial despite step (2), because 7-
and > may
Compute
change during the execution of step (3).
If
^
then assign > S%\ . We have implemented the algorithm above in Matlab.
When step (3) is implemented with nested loops, the outer
4. For each center , let be the mean of the points
loop can be over or over . For efficiency in Matlab and
assigned to .
similar languages, the outer loop should be over since
5. For each point and center , assign
typically, and the inner loop should be replaced by
C @%A'*)+,C -!
. . vectorized code that operates on all relevant collectively.
Step 4 computes the new location of each cluster center .
6. For each point , assign Setting to be the mean of the points assigned to is
7-
S%A7-
>
>
appropriate when the distance metric in use is Euclidean
% true. distance. Otherwise, may be defined differently. For
example, with -medians the new center of each cluster is
7. Replace each center by . a representative member of the cluster.
In step (3), each time
is calculated for any and ,
its lower bound is updated by assigning C J% . 4. Experimental results
Similarly, 70 is updated whenever ^
is changed or
This section reports the results of running the new al-
>
is computed. In step (3a), if
is true gorithm on six large datasets, five of which are high-
then 70
is out-of-date, i.e. it is possible that 7-
% ?
dimensional. The datasets are described in Table 1, while
>
. Otherwise, computing > is not nec- Table 2 gives the results.
essary. Step (3b) repeats the checks from (ii) and (iii) in
order to avoid computing if possible. Our experimental design is similar to the design of (Moore,
2000), which is the best recent paper on speeding up the
The fundamental reason why the algorithm above is effec- -means algorithm for high-dimensional data. However,
tive in reducing the number of distance calculations is that there is only one dataset used in (Moore, 2000) for which
at the start of each iteration, the upper bounds 70 and the the raw data are available and enough information is given
lower bounds C are tight for most points and centers to allow the dataset to be reconstructed. This dataset is
. If these bounds are tight at the start of one iteration, the called “covtype.” Therefore, we also use five other publicly
updated bounds tend to be tight at the start of the next it- available datasets. None of the datasets have missing data.
eration, because the location of most centers changes only
slightly, and hence the bounds change only slightly. In order to make our results easier to reproduce, we use a
fixed initialization for each dataset . The first center is
The initialization step of the algorithm assigns each point initialized to be the mean of . Subsequent centers are
to its closest center immediately. This requires relatively initialized according to the “furthest first” heuristic: each
many distance calculations, but it leads to exact upper
new center is ) ^'*)>+
' :U< Q
where is the
bounds 70 for all and to exact lower bounds C for set of initial centers chosen so far (Dasgupta, 2002).
many pairs. An alternative initialization method is to
start with each point arbitrarily assigned to one center. The Following the practice of past research, we measure the
initial values of 70 and C>
are then based on distances performance of an algorithm on a dataset as the number
of distance calculations required. All algorithms that ac-
calculated to this center only. With this approach, the ini-
tial number of distance calculations is only , but 70 and
celerate -means incur overhead to create and update aux-
iliary data structures. This means that speedup compared
C
are much less tight initially, so more distance calcu-
lations are required later. (After each iteration each point is
to -means is always less in clock time than in number of
always assigned correctly to its closest center, regardless of distance calculations. Our algorithm reduces the number of
how inaccurate the lower and upper bounds are at the start distance calculations so dramatically that its overhead time
of the iteration.) Informal experiments suggest that both is often greater than the time spent on distance calculations.
However, the total execution time is always much less than
initialization methods lead to about the same total number
of distance calculations.
the time required by standard -means. The overhead of
the C and 7 data structures will be much smaller with a C
Logically, step (2) is redundant because its effect is implementation than with the Matlab implementation used
achieved by condition (iii). Computationally, step (2) is
name cardinality dimensionality description
birch 100000 2 10 by 10 grid of Gaussian clusters, DS1 in (Zhang et al., 1996)
covtype 150000 54 remote soil cover measurements, after (Moore, 2000)
kddcup 95413 56 KDD Cup 1998 data, un-normalized
mnist50 60000 50 random projection of NIST handwritten digit training data
mnist784 60000 784 original NIST handwritten digit training data
random 10000 1000 uniform random data
Table 2. Rows labeled “standard” and “fast” give the number of distance calculations performed by the unaccelerated -means algorithm
and by the new algorithm. Rows labeled “speedup” show how many times faster the new algorithm is, when the unit of measurement is
distance calculations.
for the experiments reported here. For this reason, clock sion. For these applications the memory required to store
times are not reported. the lower bounds C may be the dominant storage cost.
However, the entire matrix C never needs to be kept
Perhaps the most striking observation to be made from Ta-
in main memory. If the data are streamed into memory at
ble 2 is that the relative advantage of the new method in-
creases with . The number of distance calculations grows
each iteration from disk, then the C matrix can also be
as for standard -
, where is computed if necessary.
means.
When the algorithm is used with a distance function that
A related observation is that for
we obtain a is fast to evaluate, such as an
norm, then in practice
much better speedup than with the anchors method (Moore, the time complexity of the algorithm is dominated by the
2000). The speedups reported by Moore for the “covtype” bookkeeping used to avoid distance calculations. There-
dataset are 24.8, 11.3, and 19.0 respectively for cluster- fore, future work should focus on reducing this overhead.
ing with 3, 20, and 100 centers. The speedups we obtain
The point above is especially true because a Euclidean dis-
are 8.60, 107, and 310. We conjecture that the improved
tance (or other distance) in dimensions can often be
speedup for arises in part from using the actual
compared to a known minimum distance in time. The
cluster centers as adaptive “anchors,” instead of using a set
simple idea is to stop evaluating the new squared distance
of anchors fixed in preprocessing. The worse speedup for
when the sum of squares so far is greater than the known
*% remains to be explained.
squared minimum distance. (This suggestion is usually as-
Another striking observation is that the new method re- cribed to (Bei & Gray, 1985), but in fact it is first men-
mains effective even for data with very high dimension- tioned in (Cheng et al., 1984).) Distance calculations can
ality. Moore writes “If there is no underlying structure in be stopped even quicker if axes of high variation are con-
the data (e.g. if it is uniformly distributed) there will be lit- sidered first. Axes of maximum variation may be found
tle or no acceleration in high dimensions no matter what by principal component analysis (PCA) (McNames, 2000),
we do. This gloomy view, supported by recent theoreti- but the preprocessing cost of PCA may be prohibitive.
cal work in computational geometry (Indyk et al., 1999),
At the end of each iteration, centers must be recomputed.
means that we can only accelerate datasets that have in-
Computing means takes ;
time independent of .
teresting internal structure.” While this claim is almost
certainly true asymptotically as the dimension of a dataset
This can be reduced to ; K time where is the
number of points assigned to a different center during the
tends to infinity, our results on the “random” dataset sug-
gest that worthwhile speedup can still be obtained up to at
iteration. Typically
in all except the first few itera-
tions. As mentioned above, the algorithm of this paper can
least 1000 dimensions. As expected, the more clustered a
also be used when centers are not recomputed as means.
dataset is, the greater the speedup obtained. Random pro-
jection makes clusters more Gaussian (Dasgupta, 2000), During each iteration, distances between all centers must
so speedup is better for the “mnist50” dataset than for the be recomputed, so the minimum number of distance com-
“mnist784” dataset.
putations per iteration is #! . For large , as in
vector quantization, this may be a dominant expense. Fu-
5. Limitations and extensions ture research should investigate the best way to reduce
this cost by computing approximations for inter-center dis-
During each iteration of the algorithm proposed here, the tances that are large.
lower bounds C are updated for all points and centers
. These updates take ; time, so the time complexity
Learning Theory (COLT’02) (pp. 351–363). Springer Verlag. Montolio, P., Gasull, A., Monte, E., Torres, L., & Marques, F.
(1992). Analysis and optimization of the -means algorithm
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum for remote sensing applications. In A. Sanfeliu (Ed.), Pattern
likelihood from incomplete data via the EM algorithm. Journal recognition and image analysis, 155–170. World Scientific.
of the Royal Statistical Society, 39, 1–38.
Moore, A. W. (2000). The anchors hierarchy: Using the triangle
Deng, K., & Moore, A. W. (1993). Multiresolution instance-based inequality to survive high dimensional data. Proceedings of
learning. Proceedings of the Fourteenth International Joint the Twelfth Conference on Uncertainty in Artificial Intelligence
Conference on Artificial Intelligence (pp. 1233–1239). San (pp. 397–405). Morgan Kaufmann.
Francisco: Morgan Kaufmann.
Orchard, M. T. (1991). A fast nearest-neighbor search algorithm.
Faber, V. (1994). Clustering and the continuous -means algo- International Conference on Acoustics, Speech and Signal Pro-
rithm. Los Alamos Science, 138–144. cessing (ICASSP) (pp. 2297–2300). IEEE Computer Society
Press.
Farnstrom, F., Lewis, J., & Elkan, C. (2000). Scalability for clus-
tering algorithms revisited. ACM SIGKDD Explorations, 2,
Pelleg, D., & Moore, A. (1999). Accelerating exact -means al-
gorithms with geometric reasoning. Proceedings of the Fifth
51–57.
ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining (KDD’99) (pp. 277–281).
Hamerly, G., & Elkan, C. (2002). Alternatives to the k-means
algorithm that find better clusterings. Proceedings of the Phillips, S. J. (2002). Acceleration of k-means and related clus-
Eleventh International Conference on Information and Knowl- tering algorithms. Fourth International Workshop on Algo-
edge Management (pp. 600–607). McLean, Virginia, USA: rithm Engineering and Experiments (ALENEX) (pp. 166–177).
ACM Press. Springer Verlag.