Professional Documents
Culture Documents
(2.1)
Confidence can be interpreted as an estimation of the conditional
probability of finding the right hand side part of the rule among transactions
containing the left hand side of the rule, P(B|A)
- Lift, defined as
(2.2)
the ratio between the observed support and the expected support (if A
and B were independent)
20
- Importance , defined [2] as:
(2.3)
acts as a measure of interestingness for a rule.
2.3.2 Classifications of association rules
As discussed in detail [30], association rules can be classified based on
different criteria. The most common rules classifications systems are
presented below and used throughout this thesis.
Rules can be classified:
- Based on the type of values handled in the rule. If a rule associates
the presence of certain items, it is a Boolean rule. If a rule describes
associations between quantitative items or attributes, it is a
quantitative association rule.
- Based on the dimensions of data involved in the rule: if the LHS
component has a single predicate, referring only one dimension,
then the rule is called single-dimensional. Conversely, a rule
referring multiple dimensions is a multi-dimensional association rule.
21
- Based on the level of abstractions involved in the rule set. Items
used in the various predicates composing the rule may appear at
different levels of abstraction, for example:
(Age = 20-30)Computer Games
(Age = 20-30)Computer Software
Computer Software, in this example, is a higher level of abstraction
than Computer Games. Rule sets mined at different abstraction
levels consist of multilevel association rules. If all rules refer the
same abstraction level, then the set is said to contain single-level
association rules.
2.3.3 The Market Basket Analysis problem
Association rules mining finds interesting associations among a large set of
items. A typical example is the market basket analysis. This process analyzes
customer buying habits by finding associations between items that are
frequently purchased together by customers (i.e. appear frequently
together in the same transaction, are frequently placed together in the
same shopping basket).
Such data is typically represented in a database as a transaction table,
similar to Table 2-1.
Order Number Model
SO51176
SO51176
Milk
Bread
22
SO51177
SO51177
Bread
Butter
SO51178
SO51178
SO51178
Milk
Bread
Butter
SO51179
SO51179
SO51179
SO51179
Apples
Butter
Bread
Pears
Table 2-1 Representation of shopping basket data
For a data set organized like Table 2-1 the association rules concepts are
mapped like below:
- The space of items (I) is the set of all distinct values in the Model
column
- A transaction identifier (TxId) is a distinct value in the Order Number
column.
- A transaction, identified by a transaction id (e.g.S051176) is the
set of distinct Model values that are associated with all occurrences
of the specified transaction identifier.
- An itemset is a non-empty collection of distinct values in the Model
column
- A rule is, therefore, a logical statement like
(2.4)
where M
i
is an item
23
Learning a set of association rules from market basket analysis serves
multiple purposes, such as to describe the frequent itemsets or to generate
recommendations based on the shopping basket content. Generating
recommendations for a given shopping basket is generally a two-step
process:
- Identify the rules whose precondition matches the current shopping
basket content
- Sort these rules based on some rule property, confidence, lift or
importance being the most frequently used such properties, the
recommend those consequences at the top of the sorted list that
are not already part of the shopping basket
2.3.4 Itemsets and Rules in dense representation
The data in Table 2-1 can be thought as a normalized representation of a
(very wide) table, organized in attribute/value pairs, like below:
Tx Id Milk Bread Butter Apples Pears
SO51176 1 1 0 0 0
SO51177 0 1 1 0 0
SO51178 1 1 1 0 0
SO51179 0 1 1 1 1
Table 2-2 Shopping basket data as attribute/value pairs
For most attributes in Table 2-2, a value of 0 signifies the absence (and a
value of 1, the presence) of an item in a transaction.
24
The representation in Table 2-2 is not efficient for a large catalog and
typically impossible in most RDBMS which handle up to around 1000
columns, as described in [31], [32]. However, this representation allows
adding new attributes to a transaction, attributes that are not necessarily
related to the items that are present in the shopping basket. For example,
demographic information about the customer or geographical information
about the store where the transaction has been recorded may be added to
the table. This information is typically describing a different dimension of a
transaction. (A discussion of multi-dimensional data warehouses does not
make the object of this work, but [2] as well as [30] contain thorough
discussions of the concepts).
A representation such as the one in Table 2-2 is said to be dense, as all the
features are explicitly present in data, with specific values indicating the
presence (1) or absence (0) of an item in a transaction. By contrast, a
representation such as the one in Table 2-1 is said to be sparse, as features
are implied from the presence (or absence) of an item in a transaction.
2.3.5 Equivalence of dense and sparse representations
The dense and sparse representations of transactions are equivalent with
regard to rules and itemsets.
Let now A={A
i
} be the set of all the attributes that can be associated with a
densely represented transaction, attributes which may span multiple
dimensions (for the dense dataset, each attribute A
i
is a column in the
dataset)
25
Let V
i
={v
i
j
}, the set of all possible values of the A
i
attribute.
Let an item be a certain state of an attribute, A
i
=v
i
j
Under this definition of an item, the association rules can be thought over
the dense transaction space by mapping the concepts like below:
- The space of items (I) is the set of all distinct attribute/value pairs
- A transaction identifier (TxId) is a distinct value in the Order Number
column.
- A transaction, identified by a transaction id (e.g.S051176) is the
set of attribute/value pairs defining the transaction row.
- An itemset is a non-empty collection of attribute/value pairs.
With these concepts, a rule, defined in equation (2.4) becomes a logical
statement like:
{A
1
=v
1
, , A
i
=v
i
} {A
j
=v
j
, , A
n
=v
n
} (2.5)
It is interesting to notice that, in the particular case when the number of
items in the consequent is exactly 1, an association rule becomes a
predictive rule, as it can be used to predict, with a certain confidence, the
value of a single attribute. As we will show in Section 3.1.4 below,
association rules may be employed, in commercial software packages, to
produce predictive rules, by mapping dense datasets using this
representation.
26
Note that range type columns (attributes) may have a very large number of
states, so for such attributes the corresponding set of values V
i
may have
very high cardinality, leading to a very large number of 1-itemsets. Binning
(discretization) is often used to reduce the number of states of an attribute.
Rules which apply to range intervals are called quantitative association rules
(as opposed to the Boolean association rules which deal with qualitative
statements). Srikant and Agrawal, in [33], introduce a method of fine-
partitioning the values of an attribute and then combining the adjacent
partitions as necessary. This work also introduces a modified version of the
apriori rule detection algorithm (described in detail below), version which
detects quantitative association rules.
In a typical industrial system, the transaction table used to store this
information is likely to be significantly more complex. The item catalog may
contain millions of distinct items, a fact that raises significant challenges in
finding significant rules (more in the next section, Methods for Rules
Extraction). Also, in an industrial implementation, the transactions are
likely to be stored for analysis in a data warehouse, together with additional
related information, supporting multidimensional analysis of the data.
Dimensions associated with a transaction may include customer
information, time or geo-location information etc.
27
2.4 Fuzzy Rules
Fuzzy modeling is one of the techniques being used for modeling of
nonlinear, uncertain, and complex systems. An important characteristic of
fuzzy models is the partitioning of the space of system variables into fuzzy
regions using fuzzy sets [34]. In each region, the characteristics of the
system can be simply described using a rule. A fuzzy model typically consists
of a rule base with a rule for each particular region. Fuzzy transitions
between these rules allow for the modeling of complex nonlinear systems
with a good global accuracy. One of the aspects that distinguish fuzzy
modeling from other black-box approaches like neural nets is that fuzzy
models are transparent to interpretation and analysis (to a certain degree).
However, the transparency of a fuzzy model is not achieved automatically.
A system can be described with a few rules using distinct and interpretable
fuzzy sets but also with a large number of highly overlapping fuzzy sets that
hardly allow for any interpretation.
2.4.1 Conceptualizing in Fuzzy Terms
Supposing that a particular concept is not well defined, a function can
be used to measure the grade to which an event is a member of that
concept. E.g.: today is a rainy day may have a very low value for sunny
days, a higher value for an autumn day, and a very high value for a
torrential rain day.
This membership function is typically defined to have values in the [0,1]
space, with 0 meaning that the event does not belong at all to a concept,
and 1 meaning that an event completely belongs to a certain concept. Such
28
a membership function may look like a Gaussian bell, a triangle, a
trapezoid or, in general, any take shape in the 0-1 interval (see Figure 2-2)
Figure 2-2 Standard types of membership functions (from [34] )
2.4.2 Fuzzy Modeling
Fuzzy modeling is a technique for modeling based on data. The result of this
modeling is a set of IF-THEN rules, with fuzzy predicates which establish
relations between relevant system variables. The fuzzy predicates are
0
0.2
0.4
0.6
0.8
1
1.2
Crisp
Trapezoidal
Triangular
Sigmoid
Z-function
Gaussian
29
associated with linguistic labels, so the model is in fact a qualitative
description of a system, with rules like:
IF temperature is moderate and volume is small THEN pressure is low
The meanings of the linguistic terms moderate, small and low are defined
by fuzzy sets in the domain of the respective system variables. Such models
are often called linguistic models.
Different types of linguistic models exist:
- The Mamdani model [35] uses linguistic rules with a fuzzy premise
part and a fuzzy consequent part
- The Takagi Sugeno (TS) model [36] uses rules that differ from
Mamdani models in that their consequents are mathematical
functions instead of fuzzy sets.
In a Mamdani model, the inference is the result of the rule that applies
in a certain point. The rule base represents a static mapping between the
antecedent and the consequent.
The TS model is based on the idea that the rules in the model will have
the following structure:
R
i
: w
i
(IF X
1
is A
i1
AND AND X
n
is A
in
THEN Y
i
= f
i
(.)) (2.6)
Where:
- W
i
is the rule weight (typically, 1, but it can be adjusted)
- f
i
is usually a linear function of the premise variables, x
1
x
n
30
The inference (prediction) of a TS model is computed as
(2.7)
i.e. the weighted average of the consequences of all the rules, where N
is the number of rules, Y
i
is the contribution of a certain rule and
i
is
the degree of activation of the i-th rules premise. Given the input X=(x
1
,
x
2
, x
n
),
i
is computed like below (the product of the membership
function for all the predicates of the current rule)
(2.8)
Because of the linear structure of the rule consequents, well known
parameter estimation techniques (e.g. least squares) can be used to
estimate the consequent parameters.
31
3 Methods for Rule Extraction
In this chapter, we present the most commonly used methods for
extracting rules. Section 3.1 below presents some algorithms designed
specifically for rule extraction, such as apriori and FP-Growth. We discuss
some of the problems raised by these algorithms as well as solutions
identified for those problems. Next, in Section 3.3 we present some
techniques for extracting rules from patterns detected by other algorithms
and focus on rule extractions from neural networks, a topic of significant
interest in the next chapter. A special section, 3.2, describes the specifics of
rules analysis in Microsoft SQL Server.
3.1 Extraction of Association Rules
In this section we present some of the algorithms designed specifically for
the extraction of association rules as well as some results comparing the
real-life performance of various rules extraction algorithms.
3.1.1 The Apriori algorithm
Apriori is an influential algorithm for mining frequent itemsets for Boolean
association rules, introduced by Agrawal in [37]. The algorithm uses prior
knowledge of frequent itemset properties. Its purpose is to avoid counting
the support of every possible itemset derivable from I. Apriori exploits the
downward closure property of itemsets: if any n-itemset is frequent, then
all its subsets must also be frequent. Frequent, in this context, means that
the support (supp) of an itemset exceeds a minsup minimum support
parameter of the algorithm. Itemsets that appear less frequently than the
specified minimum support are considered infrequent and ignored by the
32
algorithm. An itemset generation and test algorithm that was not using the
apriori property was introduced also by Agrawal in [29].
The apriori algorithm is initialized by counting the occurrences of each
individual item, therefore finding the frequencies for all itemsets of size 1.
The algorithm does this by scanning the data set and counting the support
of each item. The 1-itemsets with a frequency lower than minsup are
removed. The remaining 1-itemsets constitute L
1
, the set of frequent 1-
itemsets that are interesting for the algorithm.
Once initialized, the algorithm performs iteratively the following steps:
1. The join step: a set of candidate n-itemsets, C
n
, is generated by
joining L
n-1
with itself. (By convention, apriori assumes that items
within a transaction are sorted lexicographically). The join is
performed by the compound key represented by the first n-2 items
in an itemset. Consider the (n-1)-itemsets A and B defined as below
A = {a
1
, a
2
, , a
n-2
, a
n-1
}
B = {b
1
, b
2
, , b
n-2
, b
n-1
}
(3.1)
A and B are joined if the share the join key, i.e. if
(3.2)
33
As a result of joining A and B on the compound key, a new candidate
n-itemset is produced and inserted in the C
n
set of candidate n-
itemsets:
C = {a
1
, a
2
, , a
n-2
, a
n-1
, b
n-1
} (3.3)
The
(3.4)
For each frequent itemset I, association rules can be generated like below:
- Generate all non-empty strict subsets {S
i
c I} of the itemset
- For every non-empty subset, S
i
, determine the confidence of the rule
R
i
: S
i
{I-S
i
}:
supp(R
i
) =
(3.5)
- If supp(R
i
)>minconf then add R
i
to the set of rules
The apriori method of detecting frequent itemsets may need to generate a
huge number of candidate sets. For example, if there are 10,000 frequent
items, the algorithm will need to generate more than 10
7
candidate 2-
itemsets and then scan the database in order to test their occurrence
frequencies. Some other issues raised by the apriori algorithm (and, in
general, by any algorithm driven by a minsupp parameter) are discussed in
Section 3.1.4 which treats the problem of rare rules.
3.1.2 The FP-Growth algorithm
The Frequent Pattern Growth (FP-Growth) algorithm was introduced by
Jiawei Han in [38] and refined in [39], with the purpose of extracting the
complete set of frequent itemsets, without candidate generation.
36
Figure 3-2 An FP-Tree structure
The algorithm uses a novel data structure, called a Frequent pattern Tree
(FP-tree). A FP-tree is an extended prefix tree which stores information
about frequent patterns. Only the frequent 1-items appear in the tree and
the nodes are arranged in such a way that the frequently occurring items
have better chances of node sharing than the less frequently occurring
ones. An item header table can be built to facilitate the trees traversal.
Figure 3-2 presents such a tree, together with the associated item header
table. Once an FP tree is built, mining frequent patterns in a database us
transformed to that of mining the FP-Tree. Experiments [38] show that such
a tree may be orders of magnitude smaller than the dataset it represents.
The full algorithm for building the tree is presented in Appendix A: Key
Algorithms. While building the tree, the item header table is updated to
37
contain a node link (pointer) to the first occurrence of each item in the tree.
Any new occurrence of the item in the tree (as part of a different sub-tree)
ends up being linked to the previous occurrence, so that from the item
header table one can traverse all the tree occurrences of each individual
item.
Each transaction in the database is represented on one of the paths from
the FP-tree root to a tree leaf. Consequently, for each itemset , any larger
itemset suffixed by may only appear on a path containing .
The second step of the algorithm consists in mining the tree to extract the
frequent itemsets. Each 1-itemset, in reverse order of the frequency, is
considered as an initial suffix pattern. By traversing the linked list of
occurrences of the initial suffix pattern in the tree, a conditional pattern
base is created, consisting of full prefix paths in the FP tree that co-occur
with the current suffix pattern. The conditional pattern base is used to
create a conditional FP-tree. This conditional tree is then mined recursively.
All the detected patterns are concatenated with the original suffix pattern
used to create the conditional FP-tree.
As opposed to Apriori, which performs restricted itemset generation and
testing, the frequent pattern mining algorithm performs only a restricted
testing of the itemsets. Also, the mining of the FP-tree is based on
partitioning, reducing dramatically the size of the conditional pattern base.
Refinements to the original FP-Tree mining algorithm are proposed in [39].,
including a method to scale the FP-tree mining by using database
38
projections. For a
i
a itemset and DB a database, the a
i
-projected database is
based on DB and contains all transactions which contain a
i
, after eliminating
from them infrequent items (all items that appear after a
i
in the list of
frequent items).
Also included in [39] is a comparative analysis of the FP-growth algorithm
and an alternative database projection-based algorithm, TreeProjection
(described in Section 3.1.3). The FP-Growth algorithm is determined to be
more efficient both in terms of memory consumption and computational
complexity.
3.1.3 Other algorithms and a performance comparison
Partition, an algorithm introduced in 1995 by Savasere et al. in [40]
generates all the frequent itemsets and rules in at most 2 scans of the
database. In the first scan, it divides the database in a number of non-
overlapping partitions and computes, for each partition, the frequent
itemsets. The union of these partition frequent itemsets is a superset of all
frequent itemsets, so it may contain items that are not globally frequent. A
second scan of the database is employed to compute the actual support for
all candidate itemsets (and remove those that are not globally frequent).
We mentioned, in the previous section, the TreeProjection algorithm. It was
introduced in 2000 by Agarwal, in [41], and uses a lexicographic tree to
39
represent the itemsets. Transactions are projected onto the tree nodes for
counting the support of frequent itemsets.
A different approach to rules mining is to discover the closed itemsets, a
small representative subset that captures frequent itemsets without loss of
information. This idea was introduced in 1999 by Pasquier et al. in [42]. An
algorithm to detect closed itemsets called CLOSE was introduced in the
same paper. After finding the frequent k-itemsets, Close compares the
support of each set with its subsets at the previous level. If the support of
an itemset matches the support of any of its subsets, the itemset is pruned.
The second step in Close is to compute the closure of all the itemsets found
in the first step. An improved version, A-CLOSE, was introduced in [43],
which generates a reduced set of association rules without having to
produce all frequent itemset, reducing, this way, the computational cost.
Charm is another algorithm for generating closed frequent itemsets for
association rules, introduced in [44]. Charm explores simultaneously the
itemset space as well as the transaction space and uses a very efficient
search method to identify the frequent closed itemsets (instead of
enumerating many possible subsets)
A 2001 study compared the performance of some of the commonly used
rules or frequent itemset detection algorithms [45]. Apriori, FP-Growth and
TreeProjection were included among the tested algorithms. The study used
three real-world datasets as well as one artificial dataset, T10I4D100K from
40
IBM Almaden. The original URL indicated for the data generator,
http://www.almaden.ibm.com/software/quest/Resources/index.shtml,
seems unavailable now (June 2011), but the test datasets can be
downloaded from http://fimi.ua.ac.be/data/ ). The algorithm performances
claimed by their respective authors were confirmed on artificial datasets,
but some of these gains did not seem to carry to the real datasets. As
reported in [45], a very quick growth in the number of rules is associated
with very small changes in the minimum support threshold, suggesting that
the choice of algorithm only matters at support levels that generate more
rules than would be useful in practice.
3.1.4 Problems raised by Minimum Support rule itemset
extraction systems
The most commonly used algorithms for rule extraction, apriori and FP-tree,
just like most of the other algorithms mentioned previously, focus on
finding frequent itemsets, i.e. itemsets that exceed a certain minimum
support. All itemsets (and, consequently, rules) that do not meet the
minsup threshold are ignored by these algorithms.
Rules with low support and high confidence, however, may be very
interesting for certain applications, particularly for e-Commerce
applications which aim to yield high profit margin by suggesting customers
items of interest. Customers with exotic tastes may be a small minority, but
41
they share, in their respective clusters, similar interests, and
recommendation systems should, at least theoretically, be able to make
good appropriate suggestions in their case.
Rare rules may be of two forms:
- Both the antecedent and the consequent have small support and fail
the minsup test. In this case, they are never considered by common
algorithms
- The predicates in antecedent and/or the consequent exceed the
minsup criterion, but they only rarely co-occur, and the combination
ends up being ignored by the algorithms.
The simple solution of reducing the minsup threshold will not function
practically. On a theoretical level, the minimum support criterion is what
makes both apriori and FP-tree practical for large datasets. The comparative
study in [45] (discussed in Section 3.1.3 above) shows that, on certain
datasets, small reductions on minimum support value may lead to
extremely rapid growth in the number of rules.
Some research has been carried recently in the area of rare rules detection.
A collection of most significant results in this area is available in [46]. A few
different approaches have been taken in solving this problem.
One of the approaches consists in using a variable minimum support
threshold. Each itemset may have a different support threshold, which can
be predefined or can be dynamically lowered to allow for rare itemset
inclusion.
42
Multiple Support Apriori (MSApriori), introduced in [47], allows each
database item to have its own minimum support. The minimum support for
an n-itemset, n>1, is computed as the minimum per components. To
facilitate the detection of small support itemsets, the items are sorted in
ascending order of their minimum support values rather than in the
conventional lexicographic order used by apriori. As it is impractical to
associate an individual minimum support with each item in a large product
catalog, the authors suggest a Lowest Allowable Minimum Support (LS) and
a constant e[0,1] as algorithm parameter. An arbitrary items minimum
support will then be
(3.6)
The algorithm detects certain rare itemset and rules, but the criterion is
users value rather than the frequency of items.
Relative Support Apriori, introduced in [48], is a refinement on top of
MSApriori which avoids the user input (the parameter of the MSApriori
algorithm) and defines a new threshold for itemsets, the relative support,
which measures the confidence of rare items. The relative support
threshold (defined below) imposes a higher support limit for items that are
globally infrequent.
43
(3.7)
Adaptive Apriori, introduced in [49], introduces the idea of support
constraints, a function which produces minimum support for specified
itemsets. Multiple constraints are combined by picking the minimum. The
resulting apriori implementation generates only necessary itemsets, i.e.
itemsets that meet the set of predefined constraints.
LPMiner, introduced in [50], also uses a variable minimum support
threshold. The authors propose a support threshold which decreases with
the length of the itemset. The implementation is based on the FP-tree
algorithm.
A very different approach consists in completely eliminating the minimum
support threshold.
A family of algorithms based on MinHashing is presented in [51]. These
algorithms detect rare itemsets of very highly correlated items. The
algorithms represent transactions, conceptually, as a 0/1 matrix with one
row per transaction and as many columns as distinct items. In this
representation, the confidence of a rule is the number of rows with 1 in
44
both columns divided by the number of rows with1 in either column. This
representation is not practical, as it would be very large. The authors
suggest computing a hashing signature for each column so that the
probability that two columns have the same signature is proportional to
their similarity.
An example of such a hash, a random order of rows is selected and a
columns hash is the first row index (under the new order) where the
column has a 1. The article shows that the probability that two columns
share a signature is proportional to their similarity.
To reduce the number of false positives and false negatives, multiple
signatures are selected (by repeating the process independently) . The
resulting candidate pairs are generated and checked against the real
database (the original matrix). The algorithm is implemented for rules with
2 or 3 itemsets but is not yet extended beyond this size.
Apriori Inverse, proposed in [52], is also a variation of the apriori algorithm
but it uses maximum support instead of minsup. Candidates of interest are
below maxsup, but still above an absolute minimum support (minabssup,
noise threshold). A rule X is interesting if sup(X)<maxsup AND
sup(X)>minabssup.
45
Apriori Rare, proposed in [53], splits the problem of detecting rare itemsets
in two tasks. The authors introduced the concepts of:
- Maximal frequent itemset (MFI), an itemset which is frequent, but
all its proper supersets are rare
- Minimal rare itemset (mRI) , a rare itemset having all proper subsets
frequent
- Generator, an itemset that has no proper subset with the same
support (i.e. c )
The mRIs can be detected naively, using apriori, or by using a new algorithm
introduced in the paper, called MRG-Exp, which avoids exploring all
itemsets and instead only looks for frequent generators in the itemset
lattice. The second part consists in restoring rare itemsets from mRIs, using
an algorithm called Arima (A Rare Itemset Miner Algorithm).
3.2 An implementation perspective: Support for association
analysis in Microsoft SQL Server 2008
This section describes some of the innovations supporting association
analysis in the Microsoft SQL Server Analysis Services 2008 platform (AS)
as a context for some for some of the work presented in this document. We
originally published the core of this material in our previously published
volume [2].
46
Analysis Services separates between data storage objects (mining
structures), which are in essence multi-dimensional data stores , and mining
models, instantiations of data mining algorithms on top of projections of
mining structures.
In the simplest case, a mining structure may be a table. A mining model
belonging to that structure may use some or all of the table columns.
Mining structure columns can be referred more than once in the same
mining model. A mining case is a data point used in training a data mining
algorithm or one that needs to be scored by a trained algorithm.
A significant innovation in the AS product is the concept of nested tables.
From data mining modeling perspective, a nested table is a tabular feature
of a mining case.
Figure 3-3 A mining case containing tabular features
Figure 3-3 represents such a mining case (a customer, in this case). The case
contains certain scalar features such as Key (a unique identifier), Gender,
Age or name. Tabular features, such as the list of purchases or ratings
47
produced by this costumer for certain movies, can also be logically
associated with the customer.
From a relational database perspective, a customer with the related
information is represented as join relationships between several tables.
Figure 3-4 presents the relational database structure associated with the
mining case represented by the customer plus purchases and movie ratings.
A mining structure can store data from multiple tables and models built
inside that structure can access data from multiple tables as features of the
same mining case. The modeling of nested tables is centered on the key
columns of the nested table. Each individual value of a nested table key is
mapped to one or more modeling attributes.
Figure 3-4 A RDBMS representation of the data supporting mining cases with nested tables
48
For example, consider a classification model that aims to predict a
customers age based on gender and the lists of purchases as well as movie
ratings. Each mining case will have the following attributes:
- Gender , Age from the People table
- Purchases(Milk), Purchases (Bread), Purchases (Apples) all of
them with values of Existing/Missing
- Purchase(Bread).Quantity, Purchases(Milk).Quantity,
Purchases(Apples).Quantity either missing or mapped to the
Quantity column of the Purchases relational table
The feature space for a mining case is very wide and contains all possible
values for each nested table key (and the related attributes). However, a
mining case is represented sparsely, only those nested attributes having the
Existing state are presented to the mining algorithm. Given that the mining
algorithm has full access to the feature space information (dimensionality,
data types), it can effectively mine the very large feature space.
The abstraction on top of the physical feature set is part of the AS platform
and all the data mining algorithms running on the AS platform must,
therefore, support sparse feature sets.
The nested table concept in AS allows mining complex patterns directly
from relational database, without a need to move the data to an internal
representation.
49
The nested tables are particularly useful in mining association rules, as they
map to the database representation of transactions. Using equivalence
(shown in Section 2.3.4 above) between the transactional and tabular data
for the association rules algorithm, the result is an implementation that can
detect association rules between nested table items (transactional items)
and scalar features. The AS implementation of association rules is,
therefore, able to produce associative rules combining multidimensional
predicates, like below:
(3.8)
Models, inside mining structures, use projections of the data in the
structure. Columns from the mining structure may appear once, multiple
times or not at all in a model. Rows of the mining structure may be filtered
out of models as well.
50
Figure 3-5 Using a structure nested table as source for multiple model nested tables
Figure 3-5 presents an example of complex modeling using filters:
- The mining structure on left contains a single nested table with 2
columns: product name and a flag indicating whether the product
was On Sale when purchased or not
- A model is built inside the mining structure, containing two nested
tables, both linked to the single mining structure nested table, but
with different row filters.
Rules can be mined, now, to detect how On Sale products drive sales of
other products.
3.3 Rules as expression of patterns detected by other
algorithms
The descriptive power of rules makes them a frequently used tool for
explaining the patterns extracted by various machine learning algorithms.
51
3.3.1 Rules based on Decision Trees
Decision trees building algorithms are frequently used for rule extraction.
Trees induction methods are producing patterns that can be easily be
converted to rule sets. Every node in a classification tree (such as ID3,
iterative dichotomiser 3, introduced by Quinlan [54]) or classification-and-
regression-trees (CART, introduced by Breiman at al., [55]) can be easily
converted to a rule by treating the full path, from root to the respective
node, as antecedent and the histogram of the node as consequent.
Collections of trees (forests) can be used to extract association rules, similar
to the ones detected by the apriori algorithm. An example for this is
implemented in Microsofts SQL Server data mining product, as we
described in [2]. In such an implementation, a tree is built for each item in
the item catalog, with the purpose of extracting rules that have that
respective item as a consequent. Figure 3-6 shows such a tree, built for the
Eyes Wide Shut movie item as a consequent. An example of such rule is:
R1: (Full metal Jacket)(Eyes Wide Shut)
supp(R1)=(total support for the leaf node)=56
conf(R1)=(from the histogram of the leaf node)=11/56=0.1964
52
Figure 3-6 A decision tree built for rules extraction (part of a SQL Server forest)
3.3.2 Rules from Neural Networks
An artificial neural network (ANN) is a mathematical (or computational)
model inspired from functional aspects of biological neural networks. An
ANN consists of groups of artificial interconnected neurons. A very
thorough description of artificial neural networks does not make the object
of this work and can be found in [56]. Some concepts and properties of
ANNs that are relevant to this work are summarized from [56] in this
section.
53
Figure 3-7 An artificial neural network
Each artificial neuron is a simplified abstraction of a biological neuron. A
neuron receives one or more inputs and sums them to produce an output. A
neuron typically combines the inputs by means of some weighted sum, and
then the result is passed through a non-linear function called activation or
transfer function for the neuron. The output of a neuron is:
(3.9)
54
Where:
- m is the number of inputs for the current neuron
- w
kj
is, respectively, the weight associated with the connection
between input j and the current neuron
- x
j
is the actual input value
- is the activation function for the neuron.
Frequently used activation functions include the step function or a sigmoid
function.
The ANN in Figure 3-7 has neurons disposed in 3 layers: an input layer, a
hidden one and an output layer. Complex systems may have more hidden
layers. For the purpose of this work, networks can be organized as any
directed acyclic graph (feed forward networks).
An artificial neural network is usually defined by
- The topology of the network (the connections between neurons)
- The learning process for updating the weights of the
interconnections
- The activation functions of the neurons
Neural networks can be used to model complex relationships between
inputs and outputs and are frequently employed in such tasks as
classification or pattern recognition.
55
More complex neural network types were proposed for modeling complex
biological processes, such as cortical development and reinforcement
learning. The Adaptive Resonance Theory (ART), for example, described in
detail in [57], is a special kind of neural network with sequential learning
ability.
The internal structure of neural networks, specifically the presence of the
hidden layers, makes them capable of solving certain classes of difficult
classification problems (such as the non-linearly separable problems). It is
the same complexity, on the other hand, that makes neural networks less
intuitive and more difficult to interpret. A very large corpus of research has
been produced in the last decades on changing the black-box status of
neural networks and exposing the patterns inside.
Three classes of techniques are often used to describe the patterns learned
by a neural network:
- Visualization of the neural network consists of directly describing
the network topology, the weights associated with the connections
and the activation functions of the neurons
- Sensitivity analysis consists in probing the ANN with different test
inputs then recording the outputs and determining, in the process,
the impact or effect of an input variable on the output.
- Rule extraction consists in producing a set of rules that explain the
classification process
56
Visualization and sensitivity analysis do not make the object of this work.
The rest of this section presents some of the methods used in extracting
rules from neural network.
The rules extracted from a network may be crisp or fuzzy. A crisp rule is a
proposition offering crisp Yes and No answers, such as the one below:
(3.6)
A fuzzy rule is a mapping from the X input space to the space of fuzzy class
labels, as described in Section 2.4 above.
While chronologically not the first work in the area of rule extraction from
neural network, a 1995 survey on rule extraction, [58], is of particular
interest, as it introduced a frequently used taxonomy of the methods used
for rule extraction from ANNs, based on the expressive power of the rules,
the translucency of the technique (relationship between rules and ANNs
structure), quality of the rules (accuracy, fidelity to the ANNs
computations, comprehensibility), algorithmic complexity and the
treatment of variables. The taxonomy has been updated in 1998 in [59] to
cover a broader range of ANNs, such as recurrent networks. One of the first
methods for extracting rules from a neural network was proposed by Saito
and Nakano in 1988, in [60]. It is a sensitivity analysis approach, which
observes the effects that changes in the inputs cause on the network
output. The problem raises challenges due to the large number of input
combinations that need to be evaluated. The authors employ a couple of
57
heuristics to deal with this problem, such as limiting the number of
predicates that may appear in an input.
In 1999, it is shown, in [61], that multilayer feed-forward networks are
universal approximators, i.e. can uniformly approximate any real
continuous function on a compact domain. In 1994, the same thing is
shown, in [62], for certain fuzzy rules based systems (FRBS), specifically
fuzzy additive systems, i.e. systems based on rules such as:
(3.11)
where p
jk
is a linear function on the inputs.
This equivalence led authors to discuss the equivalence of neural nets and
fuzzy expert systems, as shown in [63]. In 1998, Benitez at al. offer a
constructive proof in [64] for the equivalence of certain neural networks
and certain fuzzy rules based systems (FRBS). They show how to create a
fuzzy additive system from a neural networks with 3 layers (single hidden
layer) which uses a logistic activation function in hidden neurons and an
identity function in output neurons. The area of neuro fuzzy systems is
particularly interesting in the context of this work as it provides context for
some of the results presented in Chapter 4 below.
More work on the level of equivalence between fuzzy rule-based systems
and neural networks is presented in [65]. The authors provide a survey of
58
neuro-fuzzy rule generation algorithms. This work is used in 2005 in [66] to
extract rules IF-THEN rules from a fuzzy neural network and explain to drug
designers, in a human-comprehensible form, how the network arrives at a
particular decision.
More recently, in 2011, Chorowski and Zurada introduced a new method in
[67], called LORE (Local Rule Extraction), suited for multilayer networks with
logical or categorical (discrete) inputs. A multilayer perceptron is trained
under standard regime and then converted to an equivalent form that
mimics the original network and allows rule extraction. A new data
structure, the Decision Diagram, is introduced, which allows efficient partial
rule merging. Also, a rule format is introduced which explicitly separates
between subsets of inputs for which the answer is known from those with
an undetermined answer.
59
4 Contributions to Rule Generalization
This chapter is organized as follows. The first subsection describes some
concepts related to fuzzy rules generalization and simplification, while the
second section briefly discusses several methods for optimizing and
simplifying the rule sets. The third section focuses on one of these methods
(the Rule Base Simplification based on Similarity Measures).
The fourth section presents a rule generalization algorithm introduced in [1]
for rules extracted from Fuzzy ARTMAP classifiers. The algorithm is then
adapted to rule sets produced by common rule extraction algorithms, such
as apriori. The last section contains some ideas for further research and
some conclusions.
4.1 Fuzzy Rules Generalization
One of the aspects that distinguish fuzzy modeling from other black-box
approaches like neural nets is that fuzzy models are, to a certain degree,
transparent to interpretation and analysis. However, the transparency of a
fuzzy model is not achieved automatically. A system can be described with a
few rules using distinct and interpretable fuzzy sets but also with a large
number of highly overlapping fuzzy sets that hardly allow for any
interpretation.
Description of a system using natural language is an advantage of fuzzy
modeling. A simplified rule base makes it easier to assign qualitatively
meaningful linguistic terms to the fuzzy sets, and it reduces the number of
60
terms needed. It becomes easier for experts to validate the model and the
users can understand better and more quickly the operation of the system.
A model with fewer fuzzy sets and fewer rules is also better suited for the
design and implementation of a nonlinear (model-based) controller, or for
simulation purposes, and it has lower computational demands. Several
methods have been proposed for optimizing the size of the rule base
obtained with automated modeling techniques, and some of them are
discussed in this chapter. One of them, discussed in detail in Section 4.3 on
Similarity Measures and Rule Base Simplification, consists in measuring the
similarity of fuzzy rules and sets and merging them in order to simplify the
model. We build on the concepts introduced by this work and propose a
new method of simplifying the rule set by generalizing the rules in the
model, using data mining rule concepts such as support and accuracy.
4.1.1 Redundancy
Fuzzy models, especially if acquired from data, may contain redundant
information in the form of similarity between fuzzy sets. Three unwanted
effects that can be recognized are
1) Similarity between fuzzy sets in the model;
2) Similarity of a fuzzy set to the universal set;
3) Similarity of a fuzzy set to a singleton set.
61
As similar fuzzy sets represent compatible concepts in the rule base, a
model with many similar fuzzy sets becomes redundant, unnecessarily
complex and computationally demanding.
Some of the fuzzy sets extracted from data may be similar to the universal
set. Such fuzzy sets are irrelevant. The opposite effect is similarity to a
singleton set. During adaptation, membership functions may get narrow,
resulting in fuzzy sets almost like singletons (spikes). If a rule has one or
more such fuzzy sets in its premise, it will practically never fire, and thus the
rule does not contribute to the output of the model. However, it should be
noted that such rules may represent exceptions from the overall model
behavior
4.1.2 Similarity
Different measures have been proposed for similarity of fuzzy sets. In
general, they can be divided in
- Geometric similarity measures (e.g Minkowski class of distance
functions)
(4.1)
- Set-theoretic similarity measures (e.g. consistency index):
]
(4.2)
Where is the minimum operator
62
Setnes et al., in [68], describe some of the problems associated with using
these measures. The paper defines a set of criteria for such a measure and
introduces such a measure, which will be discussed in detail in Subsection
4.3 below.
4.1.3 Interpolation based rule generalization techniques
Takagi-Sugeno and Mamdani models perform inferences under the
assumption that the rule set completely covers the inference space (i.e. it is
dense). Interpolative reasoning methods address the problem of sparse rule
sets, which do not cover the whole inference space.
Mizumoto and Zimmermann, in [69], analyze the properties of rule
models and the possibility to interpolate new rules in the generalized
modus tollens. A modus tollens rule may be written, in logical operator
notation, as
(4.3)
In 1993, in [70], Kczy and Hirota propose a method (KH-rule
interpolation) for interpolations where results are inferred based on
computation of each o-cut level, and the resulting points are connected by
linear pieces to yield an approximate conclusion.
63
4.2 Rule Model Simplification Techniques
Extensive research is available for rule model simplification techniques.
Such techniques may target the feature set considered for rule inference,
the definition of the fuzzy sets participating in the rules or the structure of
the rules models.
4.2.1 Feature set alterations
Feature set alteration techniques share the goal of reducing the number of
features that participate in the inference process. A direct consequence of
applying such alteration techniques is that they result in simplified rule
systems, because a reduction in the number of features implies a smaller
number of predicates in rules premises. Such alterations can be classified
as Feature Extraction or Feature Selection techniques.
Feature Extraction techniques allow synthesizing of a new, lower-
dimension feature set which encompasses all or most of the variance of the
original feature set (i.e. the original information is preserved or the loss is
minimal). Such techniques include Principal Component Analysis (aka
Karhunen-Loewe transform), described in [71], which consists in identifying
the eigenvectors of the covariance matrix of the training data and
projecting the data on these eigenvectors. The eigenvalues associated with
these eigenvectors provide a measure of the variance of the whole system
along these vectors and consequently allow sorting the new coordinates
(the eigenvectors) in the order of variance. Frequently, for real data sets, a
low number of eigenvectors can account for 95% or more of the variance in
data.
64
A similar feature extraction technique is Sammons non-linear
projection [72]. In this approach, a set of high-dimensional vectors are
projected in a low-dimension space (2 or 3 dimensions) and a gradient
descent technique is used to adjust the projections so that the distance
between projections is as close as possible to the distance between the
original pairs of vectors. As the preservation of the semantic meaning is a
major advantage of the fuzzy rule models, techniques for feature
transformation (which inherently alter the models semantics) are not
treated in depth in this paper.
Feature Selection techniques do not create new features, but rather
identify the top most significant features to be used in building a model. On
real data sets, this approach often provides very good results because of
redundancy, co-linearity or irrelevance of certain data dimensions. Dash
and Liu, in [73], provide an extensive overview of the feature selection
techniques commonly used in classification systems. A very popular
technique for feature selection is the information gain method, introduced
in [54]. The information gain feature selection method sorts the input
features by the amount of entropy they reduce from the whole system and
can be used to determine which features should be retained, by keeping
those whose information gains are greater than a predetermined threshold.
Feature selection does not affect the semantic meaning of the rule model
and is used for rule simplification techniques.
65
4.2.2 Changes of the Fuzzy sets definition
Song et al., in [74], suggest using supervised learning to adapt the
parameters of the of the fuzzy membership functions defining the
components of the rules. With the assumption that the inference surface is
relatively smooth, over-fitting of the fuzzy system can be detected in two
ways. Two membership functions coming sufficiently close to each other
can be fused into a single membership function, and membership functions
becoming too narrow can be deleted. In both cases, this adaptive pruning
improves the interpretability of the fuzzy system. This approach is related to
our proposed method for rules generalization and the methods will be
compared in Subsection 4.4 below.
4.2.3 Merging and Removal Based Reduction
Automatically generated rule systems often produce redundant, similar,
inconsistent or inactive rules. Handling of similar rules is detailed in the next
section, covering Similarity Measures and Rule Base Simplification.
Inconsistent rules destroy the logical consistency of the models. Xiong and
Lits, in [75], propose a consistency index numerical assessment which
helps measuring the level of consistency/inconsistency of a rule base. They
use this index in the fitness function of a genetic algorithm which searches a
set of optimal rules under two criteria: good accuracy and minimal
inconsistency.
66
4.3 Similarity Measures and Rule Base Simplification
Setnes at al., in [68], propose a similarity measure for rules in a model.
Based on this measure, similar fuzzy sets are merged to create a common
fuzzy set to replace them in the rule base, with the goal of creating a more
efficient and more linguistically tractable model.
A similarity measure for two fuzzy sets, A and B, is defined as a function
[] (4.4)
A set of 4 criteria for a similarity measure is first introduced in [68]:
- Non-overlapping fuzzy sets should be totally non-equal.
That is,
(4.5)
- Overlapping fuzzy sets should have a similarity value
greater than 0
(4.6)
- Only equal fuzzy sets should have a similarity value of 1
(4.7)
67
- Similarity between two fuzzy sets should not be
influenced by scaling or shifting the domain on which
they are defined
With these criteria, [68] proposes a new similarity measure, based on set
theory, defined as:
(4.8)
This measure is, therefore, the ratio between the cardinality of intersection
and reunion of the sets. When the equation is rewritten using the
membership functions, in a discrete space X=(x1, x2, , xn) it becomes:
[
(4.9)
The operators are, respectively, minimum () and maximum (). This
similarity measure complies with the four criteria above and reflects the
idea of gradual transition from equal to completely non equal fuzzy sets/
With this measure defined, [68] proceeds to simplifying the rule base. Rules
that are similar to the universal fuzzy set (S(A,U)~1, x in X) can, for
example, be removed.
68
The paper also provides a solution for merging similar rules. For this, it uses
a parametric trapezoidal representation of fuzzy sets, each rule being
described by parameters:
(4.10)
The merging of two similar fuzzy sets, A and B, defined by
A
(x; a
1
, a
2
, a
3
,
a
4
) and
B
(x; b
1
, b
2
, b
3
, b
4
) is defined as a new fuzzy set, C, defined by
C
(x;
c
1
, c
2
, c
3
, c
4
), where:
c
1
= min(a
1
, b
1
)
c
4
= max(a
4
, b
4
)
c
2
=
2
a
2
+ (1-
2
b
2
)
c
3
=
3
a
3
+ (1-
3
b
3
)
(4.11)
In the definition of the C fuzzy set,
2
,
3
are between 0 and 1 and
determine which fuzzy set, A or B, has more influence on the newly
generated set C, with a default value for both of 0.5.
69
Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B (from [68])
With the merging solution described above, the authors propose an
algorithm for simplifying the rules in the model. The algorithm performs the
following steps:
- Select the most similar pair of fuzzy sets
- If the similarity score exceeds a certain parameter, ,
then merge the two fuzzy sets and update the rule set
- Repeat until no pair of fuzzy sets exceeds the threshold
- For each rule in the system, compute the similarity with
the universal set (U,
U
(x)=1 x in X). If the similarity with
the universal set exceeds a certain threshold , then
remove the rule from the set (too universal)
- Merge the rules with identical premise part
70
Figure 4-2 Merging of similar rules (from [68])
Further work in [76] refines the method in [68] by the following steps:
- Reduce the feature set by feature selection
- Apply the method in [68]
- Apply a Genetic Algorithm to improve the accuracy of the
rules. To maintain the interpretability of the rule set, the
genetic algorithm step is restricted to the neighborhood
of the initial rule set
4.4 Rule Generalization
In [1], four molecular descriptors are used (molecular weight, number of H-
bond donors and acceptors, and ClogP) to predict biological activity (IC
50
). In
the paper, we introduced a novel rule generalization algorithm and a rule
inference procedure able to improve the rules extracted from a neural
71
network. This section describes the rule generalization algorithm, discusses
the results and proposes some directions for further research.
4.4.1 Problem and context
In [1], the IC
50
prediction task uses a FAM-type prediction technique called
Fuzzy ARTMAP with Relevance (FAMR).
The Adaptive Resonance Theory (ART), described in detail in [57], is a
special kind of neural network with sequential learning ability. ARTs pattern
recognition features are enhanced with fuzzy logic in the Fuzzy ART model,
introduced in [77].
The FAMR is an incremental, neural network-based learning system used for
classification, probability estimation, and function approximation,
introduced in [78]. The FAMR architecture is able to sequentially
accommodate input-output sample pairs. Each such pair may be assigned a
relevance factor, proportional to the importance of that pair during the
learning phase.
FAM networks have the capability to easily expose the learned knowledge
in the form of fuzzy IF/THEN rules; several authors have addressed this issue
for classification tasks, such as [79] , [80]. The final goal in generating such
rules would be to explain, in human-comprehensible form, how the
network arrives at a particular decision, and to provide insight into the
influence of the input features on the target. To the best of our knowledge,
no author has discussed FAM rule extraction for function approximation
tasks, such as IC
50
prediction.
72
Carpenter and Tan, in [79] and [81] were the first who introduced a FAM
rule extraction procedure. To reduce complexity of the fuzzy ARTMAP, a
pruning procedure was also introduced. In [1] we adapt Carpenter and Tans
rule extraction method for function approximation tasks with the FAMR.
4.4.2 The rule generalization algorithm
Let O be the set of rules extracted from the FAMR model. In this section,
the quality of the rules in O is analyzed from the perspective of the
confidence (conf) and support (supp) properties described in Section 2.3.1
above.
The rules in O have support between 0.0% and 16.47%, and confidence
between 0.00% and 100.00%. To ensure the quality of the final rule set, we
use a minimum confidence and a minimum support criterion for the output
rules and prune the rules, from the extracted set, which do not meet these
minimum support and confidence criteria.
The set of rules extracted this way has the following characteristics:
- All rules are complete with regard to the input descriptors (the
antecedent of each rule contains, therefore, one predicate for each
descriptor), a consequence of the rule extraction algorithm.
- Certain descriptor fuzzy categories do not appear in any rule.
To further analyze this rule set, we introduce two new measures for the rule
set:
73
- Coverage: The percentage of training data points which have the
following property: There exists at least one rule for which the
molecules descriptors fall within the range of the antecedent (i.e.
the percentage of points for which at least one rule is triggered).
- Accuracy: The percentage of training data points which have the
following property: There exists at least one rule for which the
molecules descriptors fall within the range of all antecedents and, in
addition, the output falls within the range of the consequent (i.e. the
percentage of points for which a correct rule is triggered).
Assuming that some rules are too specific to the training set (over fitting),
we attempt to generalize them, by applying a greedy Rule Generalization
Algorithm (RGA). The RGA is applied to each rule in the set.
Rule Generalization Algorithm (RGA). Let a rule R be represented as
R: (X
1
= x
1
,X
2
= x
2
, . . . ,X
n
= x
n
) (Y = y) (4.12)
Relax R by replacing one predicate X
i
= x
i
with a wild card value,
representing any possible state and designated by the (X
i
= ) notation. By
definition, the newly formed rule has the same or better support, as its
antecedent is less restrictive. If the newly formed rules confidence meets
the minimum confidence criterion, then keep it in a pool of candidates. This
procedure is applied for all the predicates in the rule, resulting in at most n
generalized rules (where n is the number of predicates in the original rule)
which have support better or equal with the original rule. If the candidate
pool is not empty, replace the original with the candidate which maximizes
the confidence. The algorithm is applied recursively to the best
74
generalization and it stops when the candidate pool is empty (no better
generalization can be found).
The RGAs goal is to relax the rules by trying to improve, at each step, the
rule support, without sacrificing accuracy beyond the minimum acceptable
confidence level.
Figure 4-3 A visual representation of the RGA
Figure 4-3 provides a visual representation of the way the RGA works.
Consider a rule R: (X=High, Y=High) (Target = t). If, after relaxing the
Y=High condition the new rule R: (X=High, Y=*) (Target = t) has sufficient
accuracy (the support is already guaranteed), then R becomes a candidate
for replacing R.
In the worst case, the number of predicate replacements for each rule is in
O(n
2
). Any relaxation of a rule increases (or does not change) the support of
that rule. Therefore, relaxing a rule improves both its confidence and
support.
75
Example of iteratively applying RGA: This example is extracted from the
original experimental results presented with [1] . Let R be a complete rule in
the original O set. As mentioned, previously, all rules contain one predicate
for each of the four inputs.
The values for each of the descriptors are binned in 5 buckets (B
1
-B
5
), see
Chapter 6 below, presenting experimental results, for details.
R: (X
1
= B
1
, X
2
= B
2
, X
3
= B
2
, X
4
= B
3
) (Y = Excellent),
with sup(R)=6.25%, conf(R)=90.9%
(4.13)
Upon relaxing all the predicates associated with R and evaluating the
confidence and support for the relaxed derivatives, the best derivative is
selected:
R: (X
1
= *, X
2
= B
2
, X
3
= B
2
, X
4
= B
3
) (Y = Excellent),
with sup(R)=8.52%, conf(R)=93.3%
(4.14)
After applying the algorithm one more time to the generalized rule R, we
obtain a better generalization:
R: (X
1
= *, X
2
= B
2
, X
3
= *, X
4
= B
3
) (Y = Excellent),
with sup(R)=13.06%, conf(R)=95.65%
(4.15)
76
4.4.3 Applying the RGA to an apriori-derived set of rules
As described in Section 3.1.1, the most commonly used rule extraction
algorithm, apriori, produces a set of variable-length rules, having the
predicates in the antecedent sorted, usually lexicographically. Certain
apriori derivatives, such as Multiple Support apriori (discussed in 3.1.4) may
use a different sort order, but this order is preserved for all the rules that
are extracted by the algorithm.
This common sort order of the predicates, shared among all the rules in the
rule set, allows for a fast way of applying the Rules Generalization Algorithm
(introduced in the previous section) to apriori-produced rule sets. The
following property justifies the application of RGA to sets of rules
characterized by a shared sort order of the antecedent predicates.
Property 4.1: Consider two rules in a rule set having the same consequence,
C, each rule defined by a set of predicates P
i
in its antecedent: R
1
: ({P
1
}->C},
R
2
: ({P
2
}->C}. If P
1
c P
2
then R1 is a generalization of R2, similar to candidate
wildcard rules introduced in RCA.
Rationale: if P
1
is a proper subset of P
2
, then P
2
contains at least one
predicate C
i
:X
i
=x
i
, C
i
eP1. Each such predicate C
i
in the definition of P
2
can be
relaxed, resulting in P
2
=,P
1
, X
i
=*}. By repeating this for each C
i
eP
2
,
C
i
eP
1
a
relaxation of P
2
is obtained which is identical with P
1
.
77
Based on property 4.1, we propose an algorithm for simplifying apriori-like
rule sets. The algorithm traverses the set of lexicographically sorted rules
maintaining a stack of rule antecedents encountered during the scan. If a
rule matches one of the stacked prefixes, we check if the rule can be
generalized by one of the previous rules.
The algorithm is presented below:
Parameters:
T A set of rules sharing the predicate order in ant.
Output
A set of generalized rules T
Initialization:
Sort rules by consequent (resulting in subgroups G
i
cT, where
all rules in one such G
i
share the consequent.
For each group G
i
Reset prefix stack S
for each rule ReG
i
(as all rules in G
i
share the consequent, R
can be considered to be the antecedent)
while S=u (traverse the stack)
if S.topeR then
if S.confidence is satisfactory then
S.top is a generalization of R (and R can be dismissed)
end if
else Pop(S) // the stacked prefix does not match, remove it
end while // stack traversal is complete
if R has not been dismissed then
copy R to T
push R onto stack S
End if
End for Each
For a simple example, consider a trivial rule set consisting of three rules, as
below:
78
R
1
: X
1
=a Y = Excellent
R
2
: X
1
=a
AND X
2
=b
Y = Excellent
R
3
: X
1
=c
Y = Excellent
(4.16)
Rule R
1
is the first rule being read. The stack is empty, so it will not be
dismissed by a previous generalization. After processing R
1
it is added both
to the stack and to the output set T.
When rule R
2
is being read, the top of the stack contains the antecedent of
R
1
, X
1
=x
1
, which is included in the antecedent of R
2
. R
1
is, therefore, a
candidate generalization of R
2
and R
2
may be dismissed.
When rule R
3
is being read, the content of the stack does not share the
prefix of the rule, so the stack will be emptied.
As shown in the Chapter 6 below, presenting Experimental Results, this
algorithm produces very significant rule set simplifications. The experiments
suggest that the number of rules in a system is reduced, by this algorithm to
10%-20%. The complexity of the calculations is relatively small, at most
O(n
2
) in-memory operations (using the stack) where n is the cardinality of
the rule set.
Also, the RGA presented in the previous section needs to estimate the
support and confidence for each generalized rule. This is typically done by
scanning the data set (or by using additional memory in an index structure,
79
such as an FP-tree). The apriori flavor of the RGA does not require any
additional scans of the data.
Some weaknesses of the algorithm are easy to point out. For example, it is
easy to show that the greedy nature of algorithm prevents detection of all
possible generalizations of the rule set. Consider a system containing rules
like:
R
1
: X
1
=a
AND X
2
=b
Y = Excellent
R
2
: X
2
=b
Y = Excellent
(4.17)
Although R
2
is a generalization of R
1
, it will not be detected by the algorithm
because it appears in lexicographic order after R
1
.
4.5 Conclusion
We presented some of the recent research work regarding the rule systems
generalization and simplification. Much of this work is related to the space
of fuzzy rules.
The rule generalization algorithm introduced in this chapter produced very
promising experimental results, as shown in Chapter 6 below. Some known
weaknesses of the proposed algorithm suggest directions for further
research.
80
4.5.1 Future directions for the basic rule generalization algorithm
The RGA algorithm discussed above currently works by eliminating entire
slices of the premise space from the rule antecedents. While this approach
produced good results in our experiments, it is probably too coarse. A
better solution, although more computationally intensive, may be to check
the neighborhood of the initial antecedent and merge those areas which,
when added to the antecedent, keep the rules accuracy above the
minimum confidence criteria.
Figure 4-4 A finer grain approach to rule generalization
81
Figure 4-4 describes such a possible implementation. Consider a rule R:
(X=High, Y=High) (Target = t). The current algorithm relaxes, say, the
Y=High condition and produces a new rule R: (X=High, Y=*) (Target = t),
which may not have sufficient accuracy to replace R. Rather than removing
Y=High, the algorithm could investigate the vicinities of the original
antecedent cell (such as Y=M or Y=VH). The generalization would then
result in rules such as:
R:(X=High, Ye{High, Medium, Very High})
(Target = t).
(4.18)
82
This consists, in essence, in merging the antecedent part of two rules as
long as they are adjacent, they share the consequent and the resulting rule
does not fall below the minimum confidence threshold.
In the data problem we treated in [1], as well as in many real applications of
rule systems, the predicates in the antecedent, as well as in the consequent,
represent binning ranges of continuous variables. In this case, for a rule
R:(X
i
=x
i
Y=y
i
) we can define a function p:(X
i
=x
i
)[0,1] which describes
the probability density for the Y=y
i
predicate over the X
i
=x
i
area of the
space. The rule accuracy can then be thought of as a ratio between the
integral of this probability density function p and a constant function u=1
defined on the same area, (X
i
=x
i
). The accuracy of the R rule can be thought
of as :
(4.19)
83
Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the universal set
Lets consider p, the probability density function, is considered the
membership function for a fuzzy set. In this interpretation and using the
similarity measure introduced by Setnes in [68] and discussed in 4.3 above
the confidence of the rule becomes the similarity measure between the
fuzzy set defined by (p, X=x
i
) and the Universal set.
It may be interesting to investigate whether this idea might be converted in
the space of fuzzy rules, as a way of merging adjacent fuzzy sets that serve
as premises for Takagi-Sugeno rules with similar consequents, as suggested
in Figure 4-5.
From an implementation perspective, it is interesting to notice that the
algorithm allows block evaluation of multiple measurements. In a typical
relational database, all the neighbors of the premise space could be
evaluated in a single pass over data using GROUP BY relational algebra
84
constructs. This will likely produce significant performance gains. Recent
developments in the space of in-memory database systems (such as [82],
[83] ) may be useful in addressing the cost of computing the accuracy and
support while relaxing predicates.
4.5.2 Further work for the apriori specialization of the RGA
The reduction in the number of rules, as presented by the experimental
results, is significant. This reduction makes the rules set more accessible
and easier to interpret. Additional work is required, though, to estimate the
predictive power of the reduced rule set and to measure the accuracy
tradeoff that is being introduced by this rule set simplification technique.
As mentioned in Section 4.4.3, the greedy nature of algorithm prevents
detection of all possible generalizations of the rule set. A different direction
for further work is investigating whether a more complex data structure,
possibly combined with a new sort order which takes into account the
antecedents length before the lexicographic order, may address this issue.
More work is also needed to study the possibility of applying the rule
generalization algorithm to the area of multiple-level association rules
described in [84] (and also in Section 2.3.2 above).
85
5 Measuring the Usage Prediction Accuracy of
Recommendation Systems
Recommendation systems are some of the most popular applications for
data mining technologies. They are generally employed to use opinions of a
community of users in order to identify content of interest for other users.
Commercial implementations, such as Amazons, described in [85], are
helping users choose from an overwhelming set of products. The
importance of recommendation systems for industry is emphasized by the
Netflix prize [86], which attracted 51051 contestants, on 41305 teams from
186 countries (as of June 1011) in trying to build a movie recommendation
system to exceed the performance of Netflixs in-house developed system,
Cinematch.
In this chapter, we focus on metrics used for usage prediction accuracy on
offline datasets. The remaining content is structured as follows:
- An introduction to the usage of Association Rules for
recommendation systems
- An overview of the most commonly used instruments and metrics
for evaluating usage prediction accuracy
- A new instrument (diagram) proposed for evaluating usage
predictions accuracy and comparing different recommendation
systems.
- Implementation observations for the aforementioned instrument
86
5.1 Association Rules as Recommender Systems
Developers of one of the first recommender systems, [87], coined the term
Collaborative Filtering (CF) to describe a system which entails people
collaborating to help each other perform filtering by recording their
reactions to documents they read. The reactions are called annotations;
they can be accessed by other peoples filters. The term ended up being
used interchangeably with the term recommender system.
This area generated lots of scientific interest and some recent surveys, such
as [88], present in detail the algorithms and techniques being employed in
recommender systems. Item-based collaborative filtering recommendation
algorithms were introduced in [89], where the authors compare such a
system vs. user-based recommender systems. In [90], the authors show that
the Apriori algorithm offers large improvement in stability and robustness
and can achieve comparable recommendation accuracy to other commonly
employed methods, such as k-Nearest Neighbor systems.
5.2 Evaluating Recommendation Systems
Recommendation systems may be employed to annotate entities in their
context (such as filtering through structured discussion postings to discover
which may be of interest to a reader, [87] ) or to find good items, such as
the Netflix prize [86] or the Amazon recommendation engine [85].
From an implementation perspective, some of these systems may predict
item ratings (such as Netflix, [86]), while others are predicting the
probability of usage (e.g. of purchase), such as Amazons [85]. More
87
complex systems may serve as intelligent advisors, comprehensive tools
which use behavioral science techniques to guide a customer through a
purchase decision process and learn while doing this, as described in [91].
These differences in usage make comparing and evaluating accuracy
systems a difficult task, as such systems are often tuned for specific
problems or datasets. A very thorough analysis of the problem of
evaluating and comparing recommendation systems is presented by J.
Herlocker et al. in [92] and, more recently, by A. Gunawardana in [93]. Both
surveys present the tasks that are commonly accomplished by
recommendation systems, the types of analysis and datasets that can be
used and the ways in which prediction quality can be measured,
Most of the research on evaluating recommendation systems focuses on
the problem of accuracy, under the assumption that a system that provides
more accurate predictions will be preferred by a user or will yield better
results to the commercial system that deploys it. Accuracy measurements
are very different when a system predicts user opinions (such as ratings) or
probabilities of usage (e.g. purchase).
Accuracy evaluations can be completed using offline analysis, controlled live
user experiments [94], or a combination of the two. In offline evaluation,
the algorithm is used to predict certain withheld values from a dataset, and
the results are analyzed using one or more of the metrics that well discuss
in the following section. Offline evaluations are inexpensive and quick to
conduct, even on multiple datasets or recommendation systems at the
88
same time. Datasets including timestamps may be used to replay usage
(ratings and recommendations) scenarios: every time a new rating or usage
decision is made by a user, it is compared with the prediction based on the
prior data about that user.
5.3 Instruments for offline measuring the accuracy of usage
predictions
During offline evaluation, a dataset is typically available consisting in items
used by each user. A typical test consists in selecting a test user, then hiding
some of the selections and asking the recommendation system to predict a
set of items that the user will use, based on the remaining selections. The
recommended and hidden items may produce 4 different outcomes, as
shown in Table 5-1
Recommended Not Recommended
Used True Positives (TP) False Negatives (FN)
Not used False Positives (FP) True Negatives (TN)
Table 5-1 Classification of the possible result of a recommendation of an item to a user
The test may be more sophisticated when the items selected by a user are
qualified by time stamps, as is the case for retailers tracking recurrent visits
from customers (e.g. Amazon.com). In that case, a users items can be
revealed to the recommendation system in the actual chronological order.
89
5.3.1 Accuracy measurements for a single user
Upon counting the number of items in each cell of the Table 5-1 table, the
following quantities can be computed:
(7.1)
(7.2)
Precision and Recall were introduced in [95] as key metrics. These metrics
started being used for evaluation of recommendation systems in 1988 in
[96] and later in [97]. Precision represents the probability that a selected
item is relevant, while Recall represents the probability that a relevant item
will be selected. Relevance is, in the case of recommender systems, a
subjective concept, as the test user is the only person who to decide
whether a recommendation meets their requirements and the transaction
record is the only information about that users decision.
Precision and Recall are inversely related, as shown in [95] so while allowing
longer recommendation lists typically improves recall, it is likely to reduce
the precision. Several approaches have been taken to combine prevision
and recall into a single metric. One approach is the F1 metric, introduced in
[98], then used as a classifier metric in [99] and used for recommendation
systems in [97], defined as below:
90
(7.3)
In certain applications, the number of recommendations that can be
presented to a user is predefined. For such applications, the measures of
interest are Precision and Recall at N, where N is the number of presented
recommendations. For other applications, the number is not predefined, or
an optimal value needs to be determined. For the latter, curves can be
computed for metrics for various numbers of recommendations. Such
curves may compare precision to recall or true positives to false positives
rates.
The true positive/false positive curves, also known as ROC curves, are more
commonly used. ROC curves were introduced in 1969 in [100] under the
name of Relative Operating Characteristics but are more commonly
known under the name Receiver Operating Characteristics, which evolved
from their use in signal detection theory (see [101]). An example of an ROC
curve, plotting True Positives against False Positives, is shown in Figure 5-1.
The curve is obtained, for a test user, by sorting the ranked
recommendations in descending order of confidence. Then, for each
predicted item, starting at the origin of the diagram, one of the following
actions is executed:
a) If it is indeed relevant (e.g. used by the user, part of the hidden user
items) then draw the curve one step vertically
91
b) If the item is not relevant (not part of the hidden items) draw the curve
one step horizontally to the right.
Figure 5-1 Example of ROC Curve
A perfect predictive system will generate a ROC curve that goes straight up
until 100% of the items have been encountered, then straight right for the
remaining items. For multiple recommender systems, multiple ROC curves
can be plotted, one for each algorithm. If one curve completely dominates
the others, it is easy to pick the best system. When the curves intersect, the
decision depends on the application requirements. For example, an
application that can only expose a small number of recommendations may
choose the curve that is dominant in the left side of the ROC chart. Hanley
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
P
e
r
c
e
n
t
o
f
R
e
l
e
v
a
n
t
I
t
e
m
s
Percent of non-relevant Items
92
and McNeil, in [101], propose Area under Curve as a measure for comparing
implementations independently of application.
5.3.2 Accuracy Measurements for Multiple Users
We presented, in the previous section, some of the metrics used to
measure the accuracy of usage predictions for individual test users in offline
experiments. A number of strategies have been developed to aggregate the
results across test populations.
For applications that expose fixed length N recommendation lists, the
average precision and recall can be computed across the test population (at
length N), as shown in [97].
This aggregation approach is used in [102] to introduce an aggregated ROC
curve, computed over multiple users, using the same fixed number of
recommendations, called Customer ROC (CROC).
A special class of applications consists of those where the recommendation
process is more interactive, and users are allowed to obtain more and more
recommendations. Such applications can be modeled, in offline
experiments, when a timestamp is associated with each item ever used by
any test user. A ROC curve can be computed, in such as test, for each user.
The number of recommendations requested for each user depends on the
number of items used, in the test dataset, by each user. Certain
competitions, such as TRECs (Text Retrieval Conference) - [103] compute
ROC or precision/recall curves in this manner.
93
5.4 The Itemized Accuracy Curve
The accuracy measurements for recommendation systems described in the
previous section are commonly used in academic competitions or to
evaluate new systems. However, they are not commonly used in data
mining products. While lift charts, classification ROC diagrams and scatter
plots are common for classification and regression algorithms, most
products do not offer a built-in tool or comparing recommendation
systems, such as association rules models.
We propose a new instrument, introduced in [16], for evaluating the quality
of usage prediction on offline datasets. This instrument consists of a family
of curves that can be used to compare the recall for each individual item in
an item catalog for a family of recommendation systems.
The itemized accuracy curve was developed from a product need to present
users with an easy to understand diagram which allows comparing
recommendation systems as easily as the cumulative gain charts allow
comparing classification models.
A top-N recommender is a recommendation system configured to return the
top N most likely items for each input. In an industry setting such a
recommender takes as input information about one user and outputs N
items that the system predicts will be preferred by the user. A simple top-N
recommender is the Most-Frequent N-Item Recommender. It simply returns
the top N items that appear most frequently in a transaction database. A
94
more sophisticated top-N recommender may be an association rules
engine, which looks at all the user properties specified as input then
extracts those rules that match, in the antecedent, the input and sorts their
consequents by a certain rule measure, such as probability, importance, lift
etc.
Let
96
The Lift describes the performance of the current top-N recommender as
compared against the minimum acceptable baseline measure (and the
improvement on top of M
min
). The Area Under Curve describes the
performance of the current top-N recommender as compared against the
maximum theoretical baseline measure (and the improvement on top of
M
max
). If the minimum baseline measure is associated with a baseline
recommender, then the lift of that recommender is by definition 1,
regardless of the value of n.
Similarly, the Area Under Curve metric is less or equal to 1 (as it represents
the ratio to the theoretical maximum value of the accuracy measure) and, if
the maximum baseline measure is associated with a recommender, then
that recommenders Area Under Curve is by definition 1, regardless of the
value of n.
Note that the Area Under Curve aggregation is not the homonymous metric
associated with ROC curves, although it shares some of its properties, such
as being upper bound by 1 or associating a value of 1 with an ideal model.
For practical purposes, the minimum theoretical baseline measure needs
not be worse than the measure yielded by the Most-Frequent n-Item
Recommender (MFnR). A few reasons for using the MFnR as a minimum
include:
- It is practically a zero-cost recommendation system, in terms of
implementation costs
97
- It is commonly used in industry if a more sophisticated
recommendation system is not available (e.g. Would you like fries
with that? in any fast food restaurant)
For the aforementioned reasons, we will use MFnR as the minimum
recommender in the rest of this chapter. An interesting property of the
MFnR recommender is that its accuracy grows with n.
Lemma 5.1 The number of True Positives of the MFnR recommender grows,
and the number of False Negatives decreases, with the value of the n
parameter, until n reaches the cardinality of the itemset.
Rationale:
Let , the cardinality of the items catalog. For any given the
following properties derive from the definition of True Positive and False
Negatives:
(7.4)
, (7.5)
In fact, when n reaches X, False Negatives becomes 0, True Positives
becomes n and the MFnR becomes an optimal recommender with regard to
the True Positive and False Negative measures.
98
The itemized accuracy curve is obtained by plotting the accuracy measure
for each individual item in the item catalog I (as ordinate). The sort order of
the items on the abscissa improves the clarity of the diagram. For example,
sorting the items in I, the item catalog, in descending order of the
max
metric (as computed for the maximum theoretical baseline measure) may
give a good intuitive perspective on the performance of the recommender
being analyzed.
5.4.1 A visual interpretation of the itemized accuracy curve
Figure 5-2 Itemized Accuracy Curve for a top-N recommender
99
Figure 5-2 presents such an itemized accuracy curve. The upper line
represents the
max
metric (as computed for the maximum theoretical
baseline measure) for each item, while the lower line is the metric,
computed for the top-N recommender being evaluated.
The aggregations of the metrics are equivalent to integrating the
measure over the item catalog I. Therefore, the aggregated measures of Lift
and Area Under Curve can be defined below
,
Both aggregations become, therefore, ratios between areas under curve for
the graphs defined by the metrics for different recommenders.
5.4.2 Impact of the N parameter on the Lift and Area Under Curve
measures
An interesting aspect of the Lift and Area Under Curve metrics is that they
allow comparing different values of N, the number of recommendations
being produced by the recommendations system. In an e-commerce
implementation of the recommendation system, the number of
recommendations presented on the screen must be a trade-off between
the potential value of the recommendations and that of other page
elements (such as advertisements) which may compete for the same page
real estate as the recommendations. It is, therefore, useful to analyze the
value (in terms of Lift and Area Under Curve) of various values for N, the
number of recommendations being presented.
100
Figure 5-3 Evolution of Lift and Area Under Curve for different values of N
Figure 5-3 presents the evolution of the Lift and Area Under Curve
measures for a top-N recommender as the value of N changes from 1 to
100.
The horizontal line at the ordinate 1 is the minimum baseline Lift,
associated with the MFnR minimum baseline. The upper line (on top of the
baseline lift) presents the lift yielded by the top-N recommender. As shown
101
previously, in Lemma 5.1, the MFnR recommenders accuracy grows, so the
lift of the top-N recommender decreases with the growth of N.
The lines in the lower part of the diagram represent the evolution of the
Area Under Curve measure with the growth of the N parameter. The Area
Under Curve of an ideal recommender is by definition 1, while the Area
Under Curve values associated with the MFnR recommender as well as the
top-N recommender being evaluated are growing to reach 1, in the worst
case when N reaches the cardinality of the itemset.
5.5 An Implementation for the Itemized Accuracy Curve
5.5.1 Accuracy measures
We found the number of True Positives (and certain derivatives) to be a
convenient measure for the accuracy measure for each individual item in
the item catalog I. It is a simple additive measure, which can be summed up
across the transaction space as well as across the item space.
As exemplified previously, we consider an ideal predictor as the source for
the M
max
aggregation, therefore a predictor that produces zero False
Negatives. The difference between M and M
max
is, therefore, the number of
False Negatives produced by the M recommendation system being
assessed.
102
A consequence of this choice is that the Area under Curve aggregated
measure is exactly the recall associated with the recommendation system.
A related additive measure that can be used for may include the catalog
value associated with an item, (i) = Value(i)*TP(i). This allows for a more
flexible estimation of the value propose by the recommender.
5.5.2 Real data test strategies
The test dataset D consists of transactions teD defined as tuple t = (C
t
, I
t
)
where C
t
, are the transaction specific properties while I
t
is the set of items
known to be included in the transaction and which should be tested against
the real recommendations. Testing for an item ieI
t
consists in presenting
the recommender with a t
i
transaction as input. t
i
derives from t but does
not include the item i. Two different ways to construct t
i
from t are
described below.
The simplest strategy is to treat each transaction as a bag of items. In that
case, the test for that respective transaction is performed by successively
leaving each item ieI
t
out, and requesting from the target system a
recommendation.
A more elaborate strategy may take into account a timestamp associated
with the moment when an item has been added to a transaction. In this
103
case, a possible strategy is to create the test input t
i
by including, besides
the characteristic transaction properties, only the items that appeared, in
the transaction, before item i (chronologically). This approach may be more
realistic for certain e-commerce scenarios.
5.5.3 The algorithm for constructing Itemized Accuracy Curve
The algorithm, presented below uses a test population to compute counts
of true positives and false negative recommendations. The number of
recommendations to be presented to a user is an algorithm parameter.
The algorithm collects the number of occurrences and True Positive
recommendations for each item in the catalog in two item-indexed
structures, GlobalCounts and TruePositives.
When the iteration is complete, the metrics of interest can be computed as:
- M the sum of the True Positives counts
- M
max
the sum of the GlobalCounts values
- M
min
the sum of those GlobalCounts values with indices in the top
N most popular items
Note that a frequency table for the most popular items can be computed in
the same iteration. This algorithm does not compute the frequency table as
104
real world database systems may have more efficient ways of returning the
top N most popular values in a table column.
Parameters :
n number of recommendations to be presented
D test set of transactions
Initialization:
Initialize GlobalCounts, TruePositives item-indexed
vectors of counts, initialized on 0
for each transaction T
x
=(C
x
, I
x
) in the test dataset D
for each item i in the I
x
increment GlobalCounts[i]
Let T
xi
= (C
x
, (I
x
i))
Let R
I
n
= TopRecommendations(n, T
xi
)
if i eR
I
n
then
increment TruePositives[i]
IterationEnd: compute the aggregated metrics
The algorithm traverses the space of test transactions and executes one
recommendation request for each item to be tested. The complexity of the
algorithm is, therefore