Icdcs 16

Data Mining for Path Traversal Patterns in a Web
Environment
Ming-Syan Chen, Jong Soo Parkand Philip S. Yu
IBM Thomas J. Watson Research Ctr. Department of Computer Science

P.O.Box 704 Sungshin Women's University
Yorktown, NY 10598 Seoul, Korea
Abstract on certain common features. This has been explored

both in the AI domain [12, 13] and in the context of
In this paper, we explore a new data mining capa- databases [2, 6, 8]. Another source of data mining is
bility which involves mining path traversal patterns in on ordered data, such as stock market and point of
a distributed information providing environment like sales data. Interesting aspects to explore from these
world-wide-web. First, we convert the original se- ordered data include searching for similar sequences
quence of log data into a set of maximal forward ref- [1, 14], e.g., stocks with similar movement in stock
erences and lter out the eect of some backward ref- prices, and sequential patterns [5], e.g., grocery items
erences which are mainly made for ease of traveling. bought over a set of visits in sequence. It is noted
Second, we derive algorithms to determine the frequent that data mining is a very application-dependent issue
traversal patterns, i.e., large reference sequences, from and dierent applications explored will require dier-
the maximal forward references obtained. Two al- ent mining techniques to cope with.
gorithms are devised for determining large reference
sequences: one is based on some hashing and prun- In this paper, we shall explore a new data min-
ing techniques, and the other is further improved with ing capability which involves mining access patterns
the option of determining large reference sequences in in a distributed information providing environment
batch so as to reduce the number of database scans where documents or objects are linked together to fa-
required. Performance of these two methods is com- cilitate interactive access. Examples for such informa-
paratively analyzed. tion providing environments include World Wide Web
(WWW) [7] and on-line services, where users, when
seeking for information of interest, travel from one
1 Introduction object to another via the corresponding facilities (i.e.,
hyperlinks) provided. Clearly, understanding user ac-
Due to the increasing use of computing for vari- cess patterns in such environments will not only help
ous applications, the importance of database mining improving the system design (e.g., providing ecient
is growing at a rapid pace recently. Various data min- access between highly correlated objects, better au-
ing capabilities have been explored in the literature. thoring design for pages, etc.) but also be able to
One of the most important data mining problems is lead to better marketing decisions (e.g., putting ad-
mining association rules [3, 4, 9, 10, 11]. For exam- vertisements in proper places, better customer/user
ple, given a database of sales transactions, it is desir- classication and behavior analysis, etc.). Capturing
able to discover all associations among items such that user access patterns in such environments is referred to
the presence of some items in a transaction will imply as mining traversal patterns in this paper. Note that
the presence of other items in the same transaction. although some eorts have been elaborated upon ana-
Also, mining classication is an approach of trying lyzing the user behavior, there is little result reported
to develop rules to group data tuples together based on dealing with the algorithmic aspects to improve the
execution of traversal pattern mining. In addition, it
J. S. Park is partially supported by KOSEF, Korea. is important to mention that since users are travel-
ing along the information providing services to search action database as it proceeds to later passes, algo-
for the desired information, some objects are visited rithm FS is required to scan the transaction database
because of their locations rather than their content. in each pass. In contrast, by properly utilizing the
This shows the very dierence between the traversal candidate reference sequences, the second algorithm
pattern problem and others which are mainly based devised, referred to as selective-scan (SS) algorithm,
on customer transactions. This unique feature of the is able to avoid database scans in some passes so as to
traversal pattern problem unavoidably increases the reduce the disk I/O cost involved. Specically, algo-
diculty of extracting meaningful information from a rithm SS has the option of using a candidate reference
sequence of traversal data. However, as these infor- set to generate subsequent candidate reference sets,
mation providing services are becoming increasingly and delaying the determination of large reference sets
popular nowadays, there is a growing demand for cap- to a later pass when the database is scanned. Since SS
turing user behavior and improving the quality of such does not scan the database to obtain a large reference
services. set in each pass, some database scans are saved. Ex-
Consequently, we shall explore in this paper the perimental studies are conducted by using a synthetic
problem of mining traversal patterns. Our solution workload that is generated based on referencing some
procedure consists of two steps. First, we derive an al- logged traces, and performance of these two methods,
gorithm, called algorithm MF (standing for maximal FS and SS, is comparatively analyzed. It is shown
forward references), to convert the original sequence of that the option of selective scan is very advantageous
log data into a set of traversal subsequences. Each tra- and algorithm SS thereby outperforms algorithm FS
versal subsequence represents a maximal forward ref- in general. Sensitivity analysis on various parameters
erence from the starting point of a user access. As will is also conducted.
be explained later, this step of converting the original This paper is organized as follows. Problem de-
log sequence into a set of maximal forward references scription is given in Section 2. Algorithm MF to iden-
will lter out the eect of backward references which tify maximal forward references is described in Section
are mainly made for ease of traveling, and enable us 3.1, and two algorithms, FS and SS, for determining
to concentrate on mining meaningful user access se- large reference sequences are given in Section 3.2. Per-
quences. Second, we derive algorithms to determine formance results are presented in Section 4. Section 5
the frequent traversal patterns, termed large reference contains the summary.
sequences, from the maximal forward references ob-
tained above, where a large reference sequence is a
reference sequence that appeared in a sucient num- 2 Problem description
ber of times in the database. Note that the problem of
nding large reference sequences is similar to that of As pointed out earlier, in an information providing
nding large itemsets for association rules [3] where a environment where objects are linked together, users
large itemset is a set of items appearing in a sucient are apt to travel objects back and forth in accordance
number of transactions. However, they are dierent with the links and icons provided. As a result, some
from each other in that a reference sequence in min- node might be revisited because of its location, rather
ing traversal patterns has to be consecutive references than its content. For example, in a WWW environ-
in a maximal forward reference whereas a large item- ment, to reach a sibling node a user is usually inclined
set in mining association rules is just a combination to use \backward" icon and then a forward selection,
of items in a transaction. As a consequence, although instead of opening a new URL. Consequently, to ex-
several schemes for mining association rules have been tract meaningful user access patterns from the original
reported in the literature [3, 4, 10], the very dierence log database, we naturally want to take into consider-
between these two problems calls for the design of new ation the eect of such backward traversals and dis-
algorithms for determining large reference sequences. cover the real access patterns of interest. In view of
Explicitly, we devise two algorithms for determining this, we assume in this paper that a backward ref-
large reference sequences. The rst one, referred to as erence is mainly made for ease of traveling but not
full-scan (FS) algorithm, essentially utilizes some tech- for browsing, and concentrate on the discovery of for-
niques on hashing and pruning while solving the dis- ward reference patterns. Specically, a backward ref-
crepancy between traversal patterns and association erence means revisiting a previously visited object by
rules mentioned above. Although trimming the trans- the same user access. When backward references oc-
2
cur, a forward reference path terminates. This re- A
sulting forward reference path is termed a maximal 1
forward reference. After a maximal forward reference
12
is obtained, we back track to the starting point of the B O

forward referencing and resume another forward refer-
ence path. In addition, the occurrence of a null source 2
13 15
node also indicates the termination of an ongoing for- 5 6

14
ward reference path and the beginning of a new one. C E V
While deferring the formal description of the algo- 3 7 11 U
rithm to determine maximal forward references (i.e., 4
algorithm MF) to Section 3.1, we give an illustra- D

G
tive example for maximal forward references below. 8 10
Suppose the traversal log contains the following tra- 9
versal path for a user: fA; B; C; D; C; B; E; G; H; G; H

W; A; O; U; O; V g, as shown in Figure 1. Then, it can W
be veried by algorithm MF that the set of maximal
forward references for this user is fABCD; ABEGH; Figure 1: An illustrative example for traversal pat-
ABEGW; AOU; AOV g. After maximal forward refer- terns.
ences for all users are obtained, we then map the prob-
lem of nding frequent traversal patterns into the one patterns can be summarized as follows.
of nding frequent occurring consecutive subsequences Procedure for mining traversal patterns:
among all maximal forward references. A large refer- Step 1: Determine maximal forward references from
ence sequence is a reference sequence that appeared the original log data.
in a sucient number of times. In a set of maximal
forward references, the number of times a reference Step 2: Determine large reference sequences (i.e., Lk ,
sequence has to appear in order to be qualied as a k 1) from the set of maximal forward refer-
large reference sequence is called the minimal support. ences.
A large k-reference is a large reference sequence with k
elements. We denote the set of large k-references as Lk Step 3: Determine maximal reference sequences from
and its candidate set as Ck . As pointed out earlier, a large reference sequences.
very dierence between mining traversal patterns and Since the extraction of maximal reference sequences
mining association rules lies in the fact that a refer- from large reference sequences (i.e., Step 3) is straight-
ence sequence in mining traversal patterns has to be forward, we shall henceforth focus on Steps 1 and 2,
consecutive references in a maximal forward reference and devise algorithms for the ecient determination
whereas a large itemset in mining association rules is of large reference sequences.
just a set of items in a transaction. As a result, it
is necessary to devise new algorithms for determining
large reference sequences. 3 Algorithm for traversal pattern
It is worth mentioning that after large reference se-
quences are determined, maximal reference sequences We shall describe in Section 3.1 algorithm MF
can then be obtained in a straightforward manner. which converts the original traversal sequence into a
A maximal reference sequence is a large reference set of maximal forward references. Then, by mapping
sequence that is not contained in any other maxi- the problem of nding frequent traversal patterns into
mal reference sequence. For example, suppose that the one of nding frequent consecutive subsequences,
fAB; BE; AD; CG; GH; BGg is the set of large 2- we develop two algorithms, called full-scan (FS) and
references (i.e., L2 ) and fABE; CGH g is the set selective-scan (SS), for mining traversal patterns.
of large 3-references (i.e., L3). Then, the resulting
maximal reference sequences are AD; BG; ABE; and 3.1 Finding maximal forward references
CGH. A maximal reference sequence corresponds to a
\hot" access pattern in an information providing ser- In general, a traversal log database contains, for
vice. In all, the entire procedure for mining traversal each link traversed, a pair of (source, destination). For
3
the beginning of a new path, which is not linked to Table 1: An example execution by algorithm MF.
the previous traversal, the source node is null. Given
a traversal sequence f(s1 ; d1); (s2 ; d2); :::; (sn; dn)g of
a user, we shall map it into multiple subsequences, move string Y output to DF
each of which represents a maximal forward reference. 1 AB
The algorithm for nding all maximal forward ref- 2 ABC
erences is given as follows. First, the traversal log 3 ABCD
database is sorted by user id's, resulting in a traver- 4 ABC ABCD
sal path, f(s1 ; d1); (s2; d2); :::; (sn; dn)g, for each user, 5 AB
where pairs of (si ; di) are ordered by time. Algorithm 6 ABE
MF is then applied to each user path to determine all 7 ABEG
of its maximal forward references. Let DF denote the 8 ABEGH
database to store all the resulting maximal forward 9 ABEG ABEGH
references obtained. 10 ABEGW
Algorithm MF : 11 A ABEGW
Step 1: Set i = 1 and string Y to null for initializa- 12 AO
tion, where string Y is used to store the current 13 AOU
forward reference path. Also, set the ag F = 1 14 AO AOU
to indicate a forward traversal. 15 AOV AOV (end)
Step 2: Let A = si and B = di.
If A is equal to null then to C). At that point, the maximal forward reference
/* this is the beginning of a new traversal */ ABCD is written to DF (by Step 3). In the next
begin move (i.e., from C to B), although the rst condi-
Write out the current string Y (if not null) tional statement in Step 3 is again true, nothing is
to the database DF ; written to DF since the ag F = 0, meaning that it
Set string Y = B; is in a reverse traversal. The subsequent forward ref-
Go to Step 5. erences will put ABEGH into the string Y , which is
end then written to DF when a reverse reference (from H
Step 3: If B is equal to some reference (say the j-th to G) is encountered. The execution scenario by algo-
reference) in string Y then rithm MF for the input in Figure 1 is given in Table
/* this is a cross-referencing back to a previous 1.
reference */
begin 3.2 Finding large reference sequences
If F is equal to 1 then write out string Y to
database DF ; Once the database containing all maximal forward
Discard all the references after the j-th one references for all users, DF , is constructed, we can
in string Y ; derive the frequent traversal patterns by identifying
F = 0; the frequent occurring reference sequences in DF . A
Go to Step 5. sequence s1 ; ::::; sn is said to contain r1; ::::; rk as a
end consecutive subsequence if there exists an i such that
si+j = rj , for 1 j k. A sequence of k references,
Step 4: Otherwise, append B to the end of string Y . r1; ::::; rk, is called a large k-reference sequence, if there
/* we are continuing a forward traversal */ are a sucient number of users with maximal forward
If F is equal to 0, set F = 1. references in DF containing r1 ; ::::; rk as a consecutive
Step 5: Set i = i+1. If the sequence is not completed subsequence.
scanned then go to Step 2. We shall describe below two algorithms for min-
ing traversal patterns. The rst one, called full-scan
Consider the traversal scenario in Figure 1 for ex- (FS) algorithm, essentially utilizes the concept of DHP
ample. It can be veried that the rst backward ref- (i.e., hashing and pruning) while solving the discrep-
erence is encountered in the 4-th move (i.e., from D ancy between traversal patterns and association rules.
4
Although trimming the transaction database as it pro- can decrease signicantly. Same as in [10], we found
ceeds to later passes, FS is required to scan the trans- that it is generally benecial for FS to generate Ck di-
action database in each pass. In contrast, by properly rectly from Lk 1 Lk 1 (i.e., without using hashing)
utilizing the candidate reference sequences, the second after k 3.
algorithm, referred to as selective-scan (SS) algorithm, To count the occurrences of each k-reference in Ck
is improved with the option of determining large ref- to determine Lk , we need to scan through a trimmed
erence sequences in batch so as to reduce the number version of database DF . From the set of maximal for-
of database scans required. ward references, we determine, among k-references in
Ck , large k-references. After the scan of the entire
3.2.1 Algorithm on full scan (FS) database, those k-references in Ck with count exceed-
ing the threshold become Lk . If Lk is non-empty, the
To describe algorithm FS, we shall rst summarize the iteration continues for the next pass, i.e., pass k + 1.
key ideas of the DHP algorithm. The details of DHP Same as in DHP, every time when the database is
can be found in [10]. Recall that DHP has two ma- scanned, the database is trimmed by FS to improve
jor features in determining association rules: one is the eciency of future scans.
ecient generation for large itemsets and the other is
eective reduction on transaction database size after 3.2.2 Algorithm on selective scan (SS)
each scan. As shown in [10], by utilizing a hash tech-
nique, DHP is very ecient for the generation of can- Algorithm SS is similar to algorithm FS in that it
didate itemsets, in particular for the large 2-itemsets, also employs hashing and pruning techniques to re-
thus greatly improving the performance bottleneck of duce both CPU and I/O costs, but is dierent from
the whole process. In addition, DHP employs eective the latter in that algorithm SS, by properly utilizing
pruning techniques to progressively reduce the trans- the information in candidate references in prior passes,
action database size. is able to avoid database scans in some passes, thus
Recall that Lk represents the set of all large k- further reducing the disk I/O cost. The method for SS
references and Ck is a set of candidate k-references. to avoid some database scans and reduce disk I/O cost
Ck is in general a superset of Lk . By scanning through is described below. Recall that algorithm FS generates
DF , FS gets L1 and makes a hash table (i.e., H2) to a small number of candidate 2-references by using a
count the number of occurrences of each 2-reference. hashing technique. In fact, this small C2 can be used
Similarly to DHP, starting with k = 2, FS generates to generate the candidate 3-references. Clearly, a C30
Ck based on the hash table count obtained in the pre- generated from C2 C2, instead of from L2 L2 , will
vious pass, determines the set of large k-references, have a size greater than jC3j where C3 is generated
reduces the size of database for the next pass, and from L2 L2 . However, if jC30 j is not much larger than
makes a hash table to determine the candidate (k+1)- jC3j, and both C2 and C30 can be stored in the main
references. Note that as in mining association rules, memory, we can nd L2 and L3 together when the
a set of candidate references, Ck , can be generated next scan of the database is performed, thereby sav-
from joining Lk 1 with itself, denoted by Lk 1 Lk 1. ing one round of database scan. It can be seen that
However, due to the dierence between traversal pat- using this concept, one can determine all Lk 's by as
terns and association rules, we modify this approach few as two scans of the database (i.e., one initial scan
as follows. For any two distinct reference sequences in to determine L1 and a nal scan to determine all other
Lk 1, say r1; ::::; rk 1 and s1 ; ::::; sk 1, we join them large reference sequences), assuming that Ck0 for k 3
together to form a k-reference sequence only if either is generated from Ck0 1 and all Ck0 s for k > 2 can be
r1; ::::; rk 1 contains s1 ; ::::; sk 2 or s1 ; ::::; sk 1 con- kept in the memory.
tains r1; ::::; rk 2 (i.e., after dropping the rst element Note that when the minimum support is relatively
in one sequence and the last element in the other se- small or potentially large references are long, Ck and
quence, the remaining two (k 2)-references are iden- Lk could become large. If jCk0 +1j > jCk0 j for k 2,
tical). We note that when k is small (especially for the then it may cost too much CPU time to generate
case of k = 2), deriving Ck by joining Lk 1 with itself all subsequent Cj0 ; j > k + 1, from candidate sets of
will result in a very large number of candidate refer- large references since the size of Cj may become huge
ences and the hashing technique is thus very helpful quickly, thus compromising all the benet from sav-
for such a case. As k increases, the size of Lk 1 Lk 1 ing disk I/O cost. This fact suggests that a timely
5
Table 2: Results from an example run by FS and SS.
Root 3% of all internal nodes
parent node
p pj
0
k 1 2 3 4 5 6 (sec)
internal jump
(to any node)
Algorithm FS p
Ck 121 84 58 22 3
p p p
1 2 3 4
Lk 94 91 84 58 21 3 19.48
children nodes
Dk (MB) 12.8 12.8 12.2 5.3 1.9 0.26 30.80

Algorithm SS parent node
Ck 121 144 58 22 3
Leaf node
25% go back to parent node
jump to internal node
Lk 94 91 84 58 21 3 18.75 75% jump to only internal node
Dk (MB) 12.8 12.8 5.3 17.80 (a) (b)
Figure 2: A traversal tree to simulate WWW.

database scan to determine large reference sequences
will in fact pay o. After a database scan, one can ence sequences by using an RS/6000 workstation with
obtain the large reference sequences which are not de- model 560. In our experiment, the browsing scenario
termined thus far (say, up to Lm ) and then construct in a World Wide Web (WWW) environment is simu-
the set of candidate (m + 1)-references, Cm+1 , based lated. To generate a synthetic workload and determine
on Lm from that point. According to our experiments, the values of parameters, we referenced some logged
we found that if jCk0 +1j > jCk0 j for some k 2, it is traces which were collected from a gateway. First, a
usually benecial to have a database scan to obtain traversal tree is constructed to mimic WWW structure
Lk+1 before the set of candidate references becomes whose starting position is a root node of the tree. The
too big. (Same as in FS, each time the database is traversal tree consists of internal nodes and leaf nodes.
scanned, the database is trimmed by SS to improve Figure 2a shows an example of the traversal tree. The
the eciency of future scans.) We then derive Ck0 +2 number of child nodes at each internal node, referred
from Lk+1 . (We note that Ck0 +2 is in fact equal to to as fanout, is determined from a uniform distribution
Ck+2 here.) After that, we again use Cj0 to derive within a given range. The height of a subtree whose
Cj0 +1 for j k + 2. The process continues until the subroot is a child node of the root node is determined
set of candidate (j + 1)-references becomes empty. from a Poisson distribution with mean h . Then, the
Illustrative examples for FS and SS are given in height of a subtree whose subroot is a child of an inter-
Table 2 where the number of reference paths jDj = nal node Ni is determined from a Poisson distribution
200; 000 and the minimum support s = 0:75%. In this with mean equal to a fraction of the maximum height
example run, FS performs a database scan in each of the internal node Ni . As such, the height of a tree
pass to determine the corresponding large reference is controlled by the value of h .
sequences, resulting in six database scans. On the A traversal path consists of nodes accessed by a
other hand, SS scans the database only three times user. The size of each traversal path is picked from
(skipping database scans in passes 2, 4 and 5), and a Poisson distribution with mean equal to jP j. With
is able to obtain the same result. The CPU and disk the rst node being the root node, a traversal path is
I/O times for FS are 19.48 seconds and 30.8 seconds, generated probabilistically within the traversal tree as
respectively, whereas those for SS are 18.75 seconds follows. For each internal node, we determine which is
and 17.8 seconds, respectively. Considering both CPU the next hop according to some predetermined prob-
and I/O times, the execution time ratio for SS to FS abilities. Essentially, each edge connecting to an in-
is 0.73, showing a prominent advantage of SS. ternal node is assigned with a weight. This weight
corresponds to the probability that each edge will be
next accessed by the user. As shown in Figure 2b, the
4 Performance results weight to its parent node is assigned with p0, which is
1 where n is the number of child nodes.
generally n+1
To assess the performance of FS and SS, we con- This probability of traveling to each child node, pi,
ducted several experiments to determine large refer- is determined from an exponential distribution with
6
Table 3: Meaning of various parameters.
H The height of a traversal tree.
F The number of child nodes, fanout.
A parameter of a Zipf-like distribution.
HxPy x is the height of a tree and y = jP j.
jD j The number of reference paths.
Dk Set of forward references for Lk .
Ck Set of candidate k-reference sequences.
Lk Set of large k-reference sequences.
jP j Average size of the reference paths. H10P5.D200K
10
c
c
8 c
c c
unit mean, and is so normalized that the sum of the c

weights for all child nodes is equal to 1 p0 . If this CPU 6
internal node has an internal jump and the weight for Time FS c
this jump is pj , then p0 is changed to p0 (1 pj ) and (sec) 4 SS
the corresponding probability for each child node is
changed to pi (1 pj ) such that the sum of all the 2
probabilities associated with this node remains one.
When the path arrives at a leaf node, the next move 0
would be either to its parent node in backward (with a 1.5 1.25 1.0 0.75 0.5 0.25
probability 0.25) or to any internal node (with an ag- Minimum support
H10P5.D200K
gregate probability 0.75). Some internal nodes in the 20
tree have internal jumps which can go to any other
nodes. The number of internal nodes with internal 16 c
jumps is denoted by NJ , which is set to 3% of all the c c c
c c
internal nodes in general cases. Table 3 summarizes I/O 12
the meaning of various parameters used in our simu- Time
lations. (sec) 8
Figure 3 represents execution times of two meth- FS c
ods, FS and SS, when jDj = 200,000, NJ = 3%, and 4 SS
pj = 0:1. HxPy means that x is the height of a tree
and y is the average size of the reference paths. D200K 0
means that the number of reference paths is 200,000. 1.5 1.25 1.0 0.75 0.5 0.25
Minimum support
A tree for H10 was obtained when the height of a tree
is 10 and the fanout at each internal node is between Figure 3: Execution Times for FS and SS.
4 and 7. The root node consists of 7 child nodes. The
number of internal nodes is 16,200 and the number of
leaf nodes is 73,006. The number of internal nodes
with internal jumps is thus 16200 NJ =486. Note
that the total number of nodes increases as the height
of a tree increases. To make the experiment tractable,
we reduced the fanout to 2 5 for the tree of H20
with the height of 20. This tree contained 616,595 in-
ternal nodes and 1,541,693 leaves. In Figure 3, the
left graph of each HxPy.D200K represents the CPU
time to nd all the large reference sequences, and the
right graph shows the I/O time to nd them where
the disk I/O time is set to 2 MB/sec and 1 MB buer
7
is used in main memory. It can be seen from Fig- [3] R. Agrawal, T. Imielinski, and A. Swami. Mining
ure 3 that algorithm SS in general outperforms FS, Association Rules between Sets of Items in Large
and their performance dierence becomes prominent Databases. Proceedings of ACM SIGMOD, pages
when the I/O cost is taken into account. From our ex- 207{216, May 1993.
periments, it was shown that both the CPU and I/O [4] R. Agrawal and R. Srikant. Fast Algorithms for
times of each method increase linearly as the database Mining Association Rules in Large Databases.
size increases. It can be seen that SS consistently out- Proceedings of the 20th International Conference
on Very Large Data Bases, pages 478{499, Sep-
performs FS as the database size increases. tember 1994.
[5] R. Agrawal and R. Srikant. Mining Sequen-
tial Patterns. Proceedings of the 11th Interna-
5 Conclusion tional Conference on Data Engineering, pages 3{
14, March 1995.
In this paper, we have explored a new data min- [6] T.M. Anwar, H.W. Beck, and S.B. Navathe.
ing capability which involves mining traversal patterns Knowledge Mining by Imprecise Querying: A
in an information providing environment where doc- Classication-Based Approach. Proceedings of
the 8th International Conference on Data Engi-
uments or objects are linked together to facilitate in- neering, pages 622{630, February 1992.
teractive access. Our solution procedure consisted of [7] J. December and N. Randall. The World Wide
two steps. First, we derived algorithm MF to convert Web Unleashed. SAMS Publishing, 1994.
the original sequence of log data into a set of maxi-
mal forward references. By doing so, we ltered out [8] J. Han, Y. Cai, , and N. Cercone. Knowledge
the eect of some backward references and concen- Discovery in Databases: An Attribute-Oriented
trated on mining meaningful user access sequences. Approach. Proceedings of the 18th International
Conference on Very Large Data Bases, pages
Second, we developed algorithms to determine large 547{559, August 1992.
reference sequences from the maximal forward refer-
ences obtained. Two algorithms were devised for de- [9] J. Han and Y. Fu. Discovery of Multiple-Level
termining large reference sequences: one was based on Association Rules from Large Databases. Pro-
ceedings of the 21th International Conference on
some hashing and pruning techniques, and the other Very Large Data Bases, pages 420{431, Septem-
was further improved with the option of determining ber 1995.
large reference sequences in batch so as to reduce the [10] J.-S. Park, M.-S. Chen, and P. S. Yu. An Eec-
number of database scans required. Performance of tive Hash Based Algorithm for Mining Associa-
these two methods has been comparatively analyzed. tion Rules. Proceedings of ACM SIGMOD, pages
It is shown that the option of selective scan is very ad- 175{186, May, 1995.
vantageous and algorithm SS thus in general outper- [11] J.-S. Park, M.-S. Chen, and P. S. Yu. Ecient
formed algorithm FS. Sensitivity analysis on various Parallel Data Mining for Association Rules. Pro-
parameters was conducted. ceedings of the 4th Intern'l Conf. on Information
and Knowledge Management, Nov. 29 - Dec. 3,
1995.
[12] G. Piatetsky-Shapiro. Discovery, Analysis and
Presentation of Strong Rules. Knowledge Discov-
ery in Databases, pages 229{248, 1991.
References [13] J.R. Quinlan. Induction of Decision Trees. Ma-
chine Learning, 1:81{106, 1986.
[1] R. Agrawal, C. Faloutsos, and A. Swami. Ef- [14] J. T.-L. Wang, G.-W. Chirn, T.G. Marr,
cient Similarity Search in Sequence Databases. B. Shapiro, D. Shasha, and K. Zhang. Combina-
Proceedings of the 4th Intl. conf. on Foundations torial Pattern Discovery for Scientic Data: Some
of Data Organization and Algorithms, October, Preliminary Results. Proceedings of ACM SIG-
1993. MOD, Minneapolis, MN, pages 115{125, May,
[2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, 1994.
and A. Swami. An Interval Classier for Data-
base Mining Applications. Proceedings of the 18th
International Conference on Very Large Data
Bases, pages 560{573, August 1992.

Icdcs 16

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Icdcs 16

Uploaded by

Copyright:

Available Formats

Data Mining for Path Traversal Patterns in a Web

IBM Thomas J. Watson Research Ctr. Department of Computer Science

Abstract on certain common features. This has been explored

is obtained, we back track to the starting point of the B O

node also indicates the termination of an ongoing for- 5 6

algorithm MF) to Section 3.1, we give an illustra- D

Suppose the traversal log contains the following tra- 9

versal path for a user: fA; B; C; D; C; B; E; G; H; G; H

Dk (MB) 12.8 12.8 12.2 5.3 1.9 0.26 30.80

Dk (MB) 12.8 12.8 5.3 17.80 (a) (b)

Figure 2: A traversal tree to simulate WWW.

You might also like