Professional Documents
Culture Documents
Environment
Ming-Syan Chen, Jong Soo Parkand Philip S. Yu
2
cur, a forward reference path terminates. This re- A
sulting forward reference path is termed a maximal 1
forward reference. After a maximal forward reference
12
3
the beginning of a new path, which is not linked to Table 1: An example execution by algorithm MF.
the previous traversal, the source node is null. Given
a traversal sequence f(s1 ; d1); (s2 ; d2); :::; (sn; dn)g of
a user, we shall map it into multiple subsequences, move string Y output to DF
each of which represents a maximal forward reference. 1 AB
The algorithm for nding all maximal forward ref- 2 ABC
erences is given as follows. First, the traversal log 3 ABCD
database is sorted by user id's, resulting in a traver- 4 ABC ABCD
sal path, f(s1 ; d1); (s2; d2); :::; (sn; dn)g, for each user, 5 AB
where pairs of (si ; di) are ordered by time. Algorithm 6 ABE
MF is then applied to each user path to determine all 7 ABEG
of its maximal forward references. Let DF denote the 8 ABEGH
database to store all the resulting maximal forward 9 ABEG ABEGH
references obtained. 10 ABEGW
Algorithm MF : 11 A ABEGW
Step 1: Set i = 1 and string Y to null for initializa- 12 AO
tion, where string Y is used to store the current 13 AOU
forward reference path. Also, set the
ag F = 1 14 AO AOU
to indicate a forward traversal. 15 AOV AOV (end)
Step 2: Let A = si and B = di.
If A is equal to null then to C). At that point, the maximal forward reference
/* this is the beginning of a new traversal */ ABCD is written to DF (by Step 3). In the next
begin move (i.e., from C to B), although the rst condi-
Write out the current string Y (if not null) tional statement in Step 3 is again true, nothing is
to the database DF ; written to DF since the
ag F = 0, meaning that it
Set string Y = B; is in a reverse traversal. The subsequent forward ref-
Go to Step 5. erences will put ABEGH into the string Y , which is
end then written to DF when a reverse reference (from H
Step 3: If B is equal to some reference (say the j-th to G) is encountered. The execution scenario by algo-
reference) in string Y then rithm MF for the input in Figure 1 is given in Table
/* this is a cross-referencing back to a previous 1.
reference */
begin 3.2 Finding large reference sequences
If F is equal to 1 then write out string Y to
database DF ; Once the database containing all maximal forward
Discard all the references after the j-th one references for all users, DF , is constructed, we can
in string Y ; derive the frequent traversal patterns by identifying
F = 0; the frequent occurring reference sequences in DF . A
Go to Step 5. sequence s1 ; ::::; sn is said to contain r1; ::::; rk as a
end consecutive subsequence if there exists an i such that
si+j = rj , for 1 j k. A sequence of k references,
Step 4: Otherwise, append B to the end of string Y . r1; ::::; rk, is called a large k-reference sequence, if there
/* we are continuing a forward traversal */ are a sucient number of users with maximal forward
If F is equal to 0, set F = 1. references in DF containing r1 ; ::::; rk as a consecutive
Step 5: Set i = i+1. If the sequence is not completed subsequence.
scanned then go to Step 2. We shall describe below two algorithms for min-
ing traversal patterns. The rst one, called full-scan
Consider the traversal scenario in Figure 1 for ex- (FS) algorithm, essentially utilizes the concept of DHP
ample. It can be veried that the rst backward ref- (i.e., hashing and pruning) while solving the discrep-
erence is encountered in the 4-th move (i.e., from D ancy between traversal patterns and association rules.
4
Although trimming the transaction database as it pro- can decrease signicantly. Same as in [10], we found
ceeds to later passes, FS is required to scan the trans- that it is generally benecial for FS to generate Ck di-
action database in each pass. In contrast, by properly rectly from Lk 1 Lk 1 (i.e., without using hashing)
utilizing the candidate reference sequences, the second after k 3.
algorithm, referred to as selective-scan (SS) algorithm, To count the occurrences of each k-reference in Ck
is improved with the option of determining large ref- to determine Lk , we need to scan through a trimmed
erence sequences in batch so as to reduce the number version of database DF . From the set of maximal for-
of database scans required. ward references, we determine, among k-references in
Ck , large k-references. After the scan of the entire
3.2.1 Algorithm on full scan (FS) database, those k-references in Ck with count exceed-
ing the threshold become Lk . If Lk is non-empty, the
To describe algorithm FS, we shall rst summarize the iteration continues for the next pass, i.e., pass k + 1.
key ideas of the DHP algorithm. The details of DHP Same as in DHP, every time when the database is
can be found in [10]. Recall that DHP has two ma- scanned, the database is trimmed by FS to improve
jor features in determining association rules: one is the eciency of future scans.
ecient generation for large itemsets and the other is
eective reduction on transaction database size after 3.2.2 Algorithm on selective scan (SS)
each scan. As shown in [10], by utilizing a hash tech-
nique, DHP is very ecient for the generation of can- Algorithm SS is similar to algorithm FS in that it
didate itemsets, in particular for the large 2-itemsets, also employs hashing and pruning techniques to re-
thus greatly improving the performance bottleneck of duce both CPU and I/O costs, but is dierent from
the whole process. In addition, DHP employs eective the latter in that algorithm SS, by properly utilizing
pruning techniques to progressively reduce the trans- the information in candidate references in prior passes,
action database size. is able to avoid database scans in some passes, thus
Recall that Lk represents the set of all large k- further reducing the disk I/O cost. The method for SS
references and Ck is a set of candidate k-references. to avoid some database scans and reduce disk I/O cost
Ck is in general a superset of Lk . By scanning through is described below. Recall that algorithm FS generates
DF , FS gets L1 and makes a hash table (i.e., H2) to a small number of candidate 2-references by using a
count the number of occurrences of each 2-reference. hashing technique. In fact, this small C2 can be used
Similarly to DHP, starting with k = 2, FS generates to generate the candidate 3-references. Clearly, a C30
Ck based on the hash table count obtained in the pre- generated from C2 C2, instead of from L2 L2 , will
vious pass, determines the set of large k-references, have a size greater than jC3j where C3 is generated
reduces the size of database for the next pass, and from L2 L2 . However, if jC30 j is not much larger than
makes a hash table to determine the candidate (k+1)- jC3j, and both C2 and C30 can be stored in the main
references. Note that as in mining association rules, memory, we can nd L2 and L3 together when the
a set of candidate references, Ck , can be generated next scan of the database is performed, thereby sav-
from joining Lk 1 with itself, denoted by Lk 1 Lk 1. ing one round of database scan. It can be seen that
However, due to the dierence between traversal pat- using this concept, one can determine all Lk 's by as
terns and association rules, we modify this approach few as two scans of the database (i.e., one initial scan
as follows. For any two distinct reference sequences in to determine L1 and a nal scan to determine all other
Lk 1, say r1; ::::; rk 1 and s1 ; ::::; sk 1, we join them large reference sequences), assuming that Ck0 for k 3
together to form a k-reference sequence only if either is generated from Ck0 1 and all Ck0 s for k > 2 can be
r1; ::::; rk 1 contains s1 ; ::::; sk 2 or s1 ; ::::; sk 1 con- kept in the memory.
tains r1; ::::; rk 2 (i.e., after dropping the rst element Note that when the minimum support is relatively
in one sequence and the last element in the other se- small or potentially large references are long, Ck and
quence, the remaining two (k 2)-references are iden- Lk could become large. If jCk0 +1j > jCk0 j for k 2,
tical). We note that when k is small (especially for the then it may cost too much CPU time to generate
case of k = 2), deriving Ck by joining Lk 1 with itself all subsequent Cj0 ; j > k + 1, from candidate sets of
will result in a very large number of candidate refer- large references since the size of Cj may become huge
ences and the hashing technique is thus very helpful quickly, thus compromising all the benet from sav-
for such a case. As k increases, the size of Lk 1 Lk 1 ing disk I/O cost. This fact suggests that a timely
5
Table 2: Results from an example run by FS and SS.
Root 3% of all internal nodes
parent node
p pj
0
k 1 2 3 4 5 6 (sec)
internal jump
(to any node)
Algorithm FS p
Ck 121 84 58 22 3
p p p
1 2 3 4
Lk 94 91 84 58 21 3 19.48
children nodes
Ck 121 144 58 22 3
Leaf node
25% go back to parent node
jump to internal node
Lk 94 91 84 58 21 3 18.75 75% jump to only internal node
6
Table 3: Meaning of various parameters.
H The height of a traversal tree.
F The number of child nodes, fanout.
A parameter of a Zipf-like distribution.
HxPy x is the height of a tree and y = jP j.
jD j The number of reference paths.
Dk Set of forward references for Lk .
Ck Set of candidate k-reference sequences.
Lk Set of large k-reference sequences.
jP j Average size of the reference paths. H10P5.D200K
10
c
c
8 c
c c
unit mean, and is so normalized that the sum of the c
weights for all child nodes is equal to 1 p0 . If this CPU 6
internal node has an internal jump and the weight for Time FS c
this jump is pj , then p0 is changed to p0 (1 pj ) and (sec) 4 SS
the corresponding probability for each child node is
changed to pi (1 pj ) such that the sum of all the 2
probabilities associated with this node remains one.
When the path arrives at a leaf node, the next move 0
would be either to its parent node in backward (with a 1.5 1.25 1.0 0.75 0.5 0.25
probability 0.25) or to any internal node (with an ag- Minimum support
H10P5.D200K
gregate probability 0.75). Some internal nodes in the 20
tree have internal jumps which can go to any other
nodes. The number of internal nodes with internal 16 c
jumps is denoted by NJ , which is set to 3% of all the c c c
c c
internal nodes in general cases. Table 3 summarizes I/O 12
the meaning of various parameters used in our simu- Time
lations. (sec) 8
Figure 3 represents execution times of two meth- FS c
ods, FS and SS, when jDj = 200,000, NJ = 3%, and 4 SS
pj = 0:1. HxPy means that x is the height of a tree
and y is the average size of the reference paths. D200K 0
means that the number of reference paths is 200,000. 1.5 1.25 1.0 0.75 0.5 0.25
Minimum support
A tree for H10 was obtained when the height of a tree
is 10 and the fanout at each internal node is between Figure 3: Execution Times for FS and SS.
4 and 7. The root node consists of 7 child nodes. The
number of internal nodes is 16,200 and the number of
leaf nodes is 73,006. The number of internal nodes
with internal jumps is thus 16200 NJ =486. Note
that the total number of nodes increases as the height
of a tree increases. To make the experiment tractable,
we reduced the fanout to 2 5 for the tree of H20
with the height of 20. This tree contained 616,595 in-
ternal nodes and 1,541,693 leaves. In Figure 3, the
left graph of each HxPy.D200K represents the CPU
time to nd all the large reference sequences, and the
right graph shows the I/O time to nd them where
the disk I/O time is set to 2 MB/sec and 1 MB buer
7
is used in main memory. It can be seen from Fig- [3] R. Agrawal, T. Imielinski, and A. Swami. Mining
ure 3 that algorithm SS in general outperforms FS, Association Rules between Sets of Items in Large
and their performance dierence becomes prominent Databases. Proceedings of ACM SIGMOD, pages
when the I/O cost is taken into account. From our ex- 207{216, May 1993.
periments, it was shown that both the CPU and I/O [4] R. Agrawal and R. Srikant. Fast Algorithms for
times of each method increase linearly as the database Mining Association Rules in Large Databases.
size increases. It can be seen that SS consistently out- Proceedings of the 20th International Conference
on Very Large Data Bases, pages 478{499, Sep-
performs FS as the database size increases. tember 1994.
[5] R. Agrawal and R. Srikant. Mining Sequen-
tial Patterns. Proceedings of the 11th Interna-
5 Conclusion tional Conference on Data Engineering, pages 3{
14, March 1995.
In this paper, we have explored a new data min- [6] T.M. Anwar, H.W. Beck, and S.B. Navathe.
ing capability which involves mining traversal patterns Knowledge Mining by Imprecise Querying: A
in an information providing environment where doc- Classication-Based Approach. Proceedings of
the 8th International Conference on Data Engi-
uments or objects are linked together to facilitate in- neering, pages 622{630, February 1992.
teractive access. Our solution procedure consisted of [7] J. December and N. Randall. The World Wide
two steps. First, we derived algorithm MF to convert Web Unleashed. SAMS Publishing, 1994.
the original sequence of log data into a set of maxi-
mal forward references. By doing so, we ltered out [8] J. Han, Y. Cai, , and N. Cercone. Knowledge
the eect of some backward references and concen- Discovery in Databases: An Attribute-Oriented
trated on mining meaningful user access sequences. Approach. Proceedings of the 18th International
Conference on Very Large Data Bases, pages
Second, we developed algorithms to determine large 547{559, August 1992.
reference sequences from the maximal forward refer-
ences obtained. Two algorithms were devised for de- [9] J. Han and Y. Fu. Discovery of Multiple-Level
termining large reference sequences: one was based on Association Rules from Large Databases. Pro-
ceedings of the 21th International Conference on
some hashing and pruning techniques, and the other Very Large Data Bases, pages 420{431, Septem-
was further improved with the option of determining ber 1995.
large reference sequences in batch so as to reduce the [10] J.-S. Park, M.-S. Chen, and P. S. Yu. An Eec-
number of database scans required. Performance of tive Hash Based Algorithm for Mining Associa-
these two methods has been comparatively analyzed. tion Rules. Proceedings of ACM SIGMOD, pages
It is shown that the option of selective scan is very ad- 175{186, May, 1995.
vantageous and algorithm SS thus in general outper- [11] J.-S. Park, M.-S. Chen, and P. S. Yu. Ecient
formed algorithm FS. Sensitivity analysis on various Parallel Data Mining for Association Rules. Pro-
parameters was conducted. ceedings of the 4th Intern'l Conf. on Information
and Knowledge Management, Nov. 29 - Dec. 3,
1995.
[12] G. Piatetsky-Shapiro. Discovery, Analysis and
Presentation of Strong Rules. Knowledge Discov-
ery in Databases, pages 229{248, 1991.
References [13] J.R. Quinlan. Induction of Decision Trees. Ma-
chine Learning, 1:81{106, 1986.
[1] R. Agrawal, C. Faloutsos, and A. Swami. Ef- [14] J. T.-L. Wang, G.-W. Chirn, T.G. Marr,
cient Similarity Search in Sequence Databases. B. Shapiro, D. Shasha, and K. Zhang. Combina-
Proceedings of the 4th Intl. conf. on Foundations torial Pattern Discovery for Scientic Data: Some
of Data Organization and Algorithms, October, Preliminary Results. Proceedings of ACM SIG-
1993. MOD, Minneapolis, MN, pages 115{125, May,
[2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, 1994.
and A. Swami. An Interval Classier for Data-
base Mining Applications. Proceedings of the 18th
International Conference on Very Large Data
Bases, pages 560{573, August 1992.