Professional Documents
Culture Documents
528
Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
detection, but it suffers from being order dependant Document3: Cat ate mouse too.
and from having a tendency to produce large clusters
[5]. Moreover, it only processes the input documents
once at a time, which is limited in large Web
tc kiNu COU
-.
documents.
Lei et al. [6] proposed an improved incremental K-
means for detecting events. In order to select initial cwc5c Lli
I-1
(II,
im
i. 11)
t'nolis
RIOU-Ne 0
cluster centers objectively, the algorithm utilizes ttx)
\R.W
chr.
\,--l
density function to initialize cluster centers. The u
529
Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
most cluster algorithms can not process document
|Bm0B |/|Bnl > 0.5
clusters Sm and Bn , with sizes IBm The document frequency of each node df(n) is
defined as the number of the different documents that
and IBnrespectively, and representing the number
have traversed node n .For example in Figure2,
of documents common to both base clusters. The df(a) = 2.
similarity of Bm and Bn to be 1 if:
Chim et al. [7] also proposed that simply ignoring
BnBn|/ IB > 0.5 and the stopwords becomes impractical in STC, because
530
Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
suffix tree model is trying to keep the sequential cells. Five evolutional measures are defined
order of each words in a document, and the same including Miss, False alarm, Recall, Precision, F-
phrase or words might occur in different nodes of the measure. To measure global performance, two
suffix tree. Based on the idea, they proposed the averaging methods are used: micro-average by
definition of "stopnode", which applies the same idea summing the corresponding cells and then compute
of stopwords in the suffix tree similarity measure the five measures, macro-average by averaging the
computation. A threshold idfthd of inverse document five measures of all events.
frequency (idf ) is given to identify whether a node
Table 2. Contingency table
is a stopnode. The experiment showed that the design in event not in event
is very efficient. in cluster a b
In our paper, we adopt the same idea of
not in cluster c d
"stopnode" instead of stopwords, which also prove to
be efficient.
In our paper, we use micro-average of F-measure
and macro-average of F-measure as our evaluation
5. Experimental results measures. We compute micro-average and macro-
average of F-measure score for each clustering result
5.1. Data preparation respectively. F-measure is originally defined by C.J.
Rijsbergen [9], which is the harmonic mean of recall
We prepare TDT [1] corpus for experiments, and precision. The measure is defined as following
which are benchmarks for event detection. We if(a + b + c) > 0, otherwise undefined:
choose TDT2 English corpus to run experiments. The
TDT2 English corpus contains news data collected F 2 * Precision * Recall 2a
daily from 6 news sources, over a period of six Precision + Recall 2a + h + c
months. Detection is an unsupervised classification Where Precision = a l(a + b) if a+b>O ,
task that does not involve training data, so all the otherwise undefined; Recall = a l(a + c) if a + c > 0,
English corpus is used as evaluation data. More otherwise undefined.
details of our dataset are listed in Tablet.
5.3. Performance on the data set
Table 1. Details of dataset
Time January - June, 1998 To compare our approach with other algorithms,
Number of Articles 1930 Yang et al.'s augmented GAC and the generally used
Number of KNN algorithm are chosen as baselines. Since GAC
Topics(Events) 70 is a hierarchical clustering method, we stop after
Average articles there are k clusters left, and run re-clustering 5 times
per event 27 as the recommended settings in [2].For KNN we
report the results under the best threshold.
5.2. Evaluation measures The original STC algorithm selects the 500 highest
scoring base clusters for further cluster merging, but
TDT project has its own evaluation plan, i.e., only the top 10 clusters are selected from the merged
detection performance is characterized in terms of the clusters as the final clustering result. Thus we also
probability of miss and false alarm errors, and these allowed GAC and KNN to generate 10 clusters in our
error probabilities are then combined into a single experiments to conduct as fair as possible
detection cost by assigning costs to miss and false comparisons.
alarm errors. However, their tasks are not consistent We use Java to implement all there algorithms, and
with ours. Thus, we choose the same evaluation use Excel to plot all figures. Figurel illustrates the
metrics as that in Yang et al.[2].Table2 illustrates the results of the three approaches. The comparison of
contingency table for a cluster -event pair, where a, the three methods is shown as Figure 1 which shows
b, c, d are document counts in the corresponding
531
Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.
the performance of the three and Figure2 which also demonstrate a substantial results using improved
shows the execution time of the three. STC algorithm, which indicate best performance and
shortest execution time compared to KNN and GAC.
In the course of this work, we encountered a
number of interesting questions and hope to answer
them in our future research. For one, we are not
satisfied with the threshold, although STC is not very
0. . sensitive to the threshold. Second, we will consider
improving STC algorithm by combination with other
algorithms, which may obtain better results.
)| !i ,
l
acavF
The present work can be extended in a number of
important directions. One is dictated by the multi-
lingual nature of TDT: STC algorithm should be
Figure3. The performance of the three methods capable of dealing with documents in multiple
languages. We are very interested in implementing
the improved STC algorithm into Chinese.
Reference
[l]Topic detection and tracking (tdt) project. homepage:
4 http://www.nist.gov/speech/tests/tdt/.
[2]Y. Yang and J. G. Carbonell et al. Learning Approaches
for Detecting and Tracking News Events [J]. IEEE
Intelligent Systems: Special Issue on Applications of
| GAC £TC Intelligent Information Retrieval, 1999, 14 (4):32 - 43.
Figure4. The execution time of the three methods [3]Y. Yang and J.Z. et al. Topic-conditioned novelty
(Time/s) detection In Proc. of SIGKDD international conference on
knowledge discovery and data mining,2002.
The best results obtained by STC are obviously. [4]J. Allan. Topic detection and tracking: event-based
information organization [M].Dordrecht: Kluwer Academic
We believe the main reason for STC's best Publishers, 2002.
performance is the STC model, which is most [5]0. Zamir and 0. Etzioni. Web Document Clustering: A
suitable for event detection. Furthermore, STC is the Feasibility Demonstrate. In Proc. of SIGIR'98, University
fastest algorithm among the three, which own to the of Washington, Settle, USA, 1998.
direct-inserting policy of STC. [6]Z. Lei and L. Wu et al. Incremental K-means Method
Based on Initialisation of Cluster Centers and Its
6. Conclusions and future work Application in News Event Detection. Journal of the China
Society for Scientific.25 (3):289-295, 2006.
[7]H. Chim and X. Deng. A New Suffix Tree Similarity
Our work presented in this paper is mainly focused Measure for Document Clustering. In Proc. of WWW'2007,
on improving the effectiveness of document Banff, Alberta, Canada, 2007.
clustering algorithms which is the hotspot of Event [8]S. Branson and A. Greenberg. Clustering Web Search
detection. Efficiency optimization of the algorithm Results Using Suffix Tree Methods. homepage:
has been a target of our current work. STC is a http://stanford.edu/lass/archive/cs/cs276a/cs276a
suitable algorithm for clustering in event detection, [9]C. J. van Rijsbergen, Information Retrieval [M].London:
which has excellent features such as linear time Butterworths, 1979.
complexity and incrementality 1. In this work we [10]J.W.Yang. A Chinese Web Page Clustering Algorithm
have shown a new event detection algorithm based Based on the Suffix Tree. Wuhan University Journal of
National Sciences [M]. 9 (5):817-822, 2004.
on improved STC algorithm. We improved STC in
the following aspects: improving method of merging
base clusters, new definition of cluster score, new
cluster labels and implementation of "stopnode". We
532
Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.