You are on page 1of 142

Brasov, 2011

Transilvania University of Braov


Faculty of Electrical Engineering and Computer
Science


Applications of computational
intelligence in data mining
By
Ioan Bogdan CRIVA

A thesis submitted in partial
fulfillment of the requirements for
the degree of
PhD

Advisor: Prof. Univ. Dr.
Razvan Andonie
Brasov, 2011

Abstract
The objective of this work is a synthesis of some of the recent efforts in
the domain of predictive and associative rules extraction and processing
as well as a presentation of certain original contributions to the area.
The first two chapters of the theses present data mining and the
some recent results in the area of rule extraction. The second chapter,
Rules in the Data Mining Context introduces data mining with a focus
on rule extraction. We discuss association rules and their properties as
well as some notions of fuzzy modeling and fuzzy rules. The third
chapter, Methods for Rules Extraction, presents the most commonly
used methods for extracting rules. A special section describes the
specifics of rules analysis in Microsoft SQL Server. The following chapters
contain some original contributions in their context. The fourth chapter,
Contributions to Rules Generalization, reviews some of the existing
methods for simplifying rule models, and focuses on measures for
detecting rules similarity. Similar rules can be merged, resulting in
simpler rule systems. The fifth chapter, Measuring the Usage
Prediction Accuracy of Recommendation Systems, presents the area of
accuracy measurements for recommendation systems, one of the most
common applications of association rules. A new instrument for
assessing the accuracy of a recommender is presented, together with
some experimental results. The sixth chapter presents some
experimental results for the techniques introduced in the third and
fourth chapters. The results are detailed for datasets used in presenting
the methods or compared against results from other authors. The last
chapter contains conclusions of this thesis as well as certain directions
for further research.

Brasov, 2011
Contents
Contents .......................................................................................................... iii
List of figures ................................................................................................... 1
Acknowledgments........................................................................................... 2
Publications, Patents and Patent Applications by the Author ........................ 3
Books ......................................................................................................... 3
Articles ...................................................................................................... 3
Issued Patents (United States Patents and Trademark Office) ................ 3
Pending patent applications (United States Patents and Trademark
Office) ........................................................................................................ 4
1 Introduction .............................................................................................. 5
1.1 Objectives...................................................................................... 5
1.2 Contributions ................................................................................ 5
1.3 The Structure of this Thesis .......................................................... 8
2 Rules in the Data Mining context............................................................ 10
2.1 Data mining in industry: an overview ........................................ 10
2.2 Data Mining Problems, Tasks and Processes .............................. 11
2.2.1 Business Problems .................................................................................... 11
2.2.2 Implementation Tasks ............................................................................... 13
2.2.3 Data Mining Project Cycle ......................................................................... 14
2.3 Rules in Data Mining ................................................................... 17
2.3.1 Association Rules ...................................................................................... 18
2.3.2 Classifications of association rules ............................................................ 20
2.3.3 The Market Basket Analysis problem ....................................................... 21
2.3.4 Itemsets and Rules in dense representation ............................................ 23
2.3.5 Equivalence of dense and sparse representations ................................... 24
2.4 Fuzzy Rules .................................................................................. 27
2.4.1 Conceptualizing in Fuzzy Terms ................................................................ 27
2.4.2 Fuzzy Modeling ......................................................................................... 28
3 Methods for Rule Extraction ................................................................... 31
3.1 Extraction of Association Rules ................................................... 31
3.1.1 The Apriori algorithm ................................................................................ 31
3.1.2 The FP-Growth algorithm ......................................................................... 35
3.1.3 Other algorithms and a performance comparison ................................... 38
3.1.4 Problems raised by Minimum Support itemset extraction
systems ................................................................................................................. 40
3.2 An implementation perspective: Support for association
analysis in Microsoft SQL Server 2008 ................................................. 45
3.3 Rules as expression of patterns detected by other algorithms .. 50
3.3.1 Rules based on Decision Trees .................................................................. 51
3.3.2 Rules from Neural Networks ..................................................................... 52
4 Contributions to Rule Generalization ..................................................... 59
Brasov, 2011
4.1 Fuzzy Rules Generalization ......................................................... 59
4.1.1 Redundancy .............................................................................................. 60
4.1.2 Similarity ................................................................................................... 61
4.1.3 Interpolation based rule generalization techniques ................................ 62
4.2 Rule Model Simplification Techniques ....................................... 63
4.2.1 Feature set alterations .............................................................................. 63
4.2.2 Changes of the Fuzzy sets definition ........................................................ 65
4.2.3 Merging and Removal Based Reduction ................................................... 65
4.3 Similarity Measures and Rule Base Simplification ...................... 66
4.4 Rule Generalization ..................................................................... 70
4.4.1 Problem and context................................................................................. 71
4.4.2 The rule generalization algorithm ............................................................ 72
4.4.3 Applying the RGA to an apriori-derived set of rules ................................. 76
4.5 Conclusion ................................................................................... 79
4.5.1 Future directions for the basic rule generalization
algorithm ............................................................................................................... 80
4.5.2 Further work for the apriori specialization of the RGA ............................ 84
5 Measuring the Usage Prediction Accuracy of Recommendation
Systems ......................................................................................................... 85
5.1 Association Rules as Recommender Systems ............................. 86
5.2 Evaluating Recommendation Systems ........................................ 86
5.3 Instruments for offline measuring the accuracy of usage
predictions .............................................................................................. 88
5.3.1 Accuracy measurements for a single user ................................................ 89
5.3.2 Accuracy Measurements for Multiple Users ............................................ 92
5.4 The Itemized Accuracy Curve ...................................................... 93
5.4.1 A visual interpretation of the itemized accuracy curve ............................ 98
5.4.2 Impact of the N parameter on the Lift and Area Under
Curve measures .................................................................................................... 99
5.5 An Implementation for the Itemized Accuracy Curve .............. 101
5.5.1 Accuracy measures ................................................................................. 101
5.5.2 Real data test strategies ......................................................................... 102
5.5.3 The algorithm for constructing Itemized Accuracy Curve ...................... 103
5.6 Conclusions and further work ................................................... 104
6 Experimental Results ............................................................................ 107
6.1 Datasets used in this material................................................... 107
6.1.1 IC
50
prediction dataset ............................................................................ 107
6.1.2 Movies Recommendation ....................................................................... 108
6.1.3 Movie Lens .............................................................................................. 109
6.1.4 Iris ............................................................................................................ 109
6.2 Experimental results for the Rule Generalization algorithm .... 110
6.2.1 Rule set and results used in Section 4.4 on generalization ................... 110
Brasov, 2011
6.2.2 Results for the apriori post-processing algorithm .................................. 112
6.3 Experimental results for the Itemized Accuracy Curve ............ 113
6.3.1 Movie Recommendation Results ............................................................ 115
6.3.2 Movie Lens Results ................................................................................. 116
7 Conclusions and directions for further research .................................. 118
7.1 Conclusions ............................................................................... 118
7.2 Further Work ............................................................................. 119
Appendix A: Key Algorithms ....................................................................... 122
Apriori ................................................................................................... 122
FP-Growth ............................................................................................. 124
Bibliography ................................................................................................ 126


List of figures
Figure 2-1 The CRISP-DM process .................................................................................... 17
Figure 2-2 Standard types of membership functions (from (20) ) .................................... 28
Figure 3-1: Finding frequent itemsets .............................................................................. 34
Figure 3-2 An FP-Tree structure ........................................................................................ 36
Figure 3-3 A mining case containing tabular features ...................................................... 46
Figure 3-4 A RDBMS representation of the data supporting mining cases
with nested tables ................................................................................................. 47
Figure 3-5 Using a structure nested table as source for multiple model
nested tables ......................................................................................................... 50
Figure 3-6 A decision tree built for rules extraction (part of a SQL Server
forest) .................................................................................................................... 52
Figure 3-7 An artificial neural network ............................................................................. 53
Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B ............................ 69
Figure 4-2 Merging of similar rules ................................................................................... 70
Figure 4-3 A visual representation of the RGA ................................................................. 74
Figure 4-4 A finer grain approach to rule generalization ................................................. 80
Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the
universal set .......................................................................................................... 83
Figure 5-1 Example of ROC Curve ..................................................................................... 91
Figure 5-2 Itemized Accuracy Curve for a top-N recommender ....................................... 98
Figure 5-3 Evolution of Lift and Area Under Curve for different values of N ................. 100
Figure 5-4 Aggregated Itemized Accuracy Curve based on the Movie
Recommendations dataset (for N=5 recommendations) ................................... 105
Figure 6-1 Itemized Accuracy Chart for n=3 (Movie recommendations) ....................... 114
Figure 6-2 Evolution of Lift for various values of N for test models (Movie
Recommendations dataset) ................................................................................ 116
Figure 6-3 Evolution of Lift for various values of N for test models (Movie
Lens dataset) ....................................................................................................... 117


2

Acknowledgments
I would like to express my deepest gratitude to Prof. Dr. Rzvan Andonie for
his guidance, patience and encouragements. Above all, I would like to thank
him for rekindling my passion for academic research after years of industrial
experience.
Deep thanks also go to the Faculty of Electrical Engineering and Computer
Science at the Transilvania University for their help and advice with the
intermediate steps of the doctoral research as well as to dr. Daniela Drgoi,
always a tremendous help for the doctoral program procedures.
I am also grateful to the amazing people that I met in my academic life,
particularly to prof. Petru Moroanu and prof. dr. Tudor Blnescu, and to
the wonderful colleagues at Microsoft Corporation and Predixion Software,
for their friendship, knowledge and experience.
At last, but certainly not least, my heartfelt thanks go to my family, Irinel
and Cosmin, for their most consistent help and support.


3

Publications, Patents and Patent Applications by the
Author
Books
1. MacLennan Jamie, Crivat Bogdan and Tang ZhaoHui Data Mining
with Microsoft SQL Server 2008 [Book]. - Indianapolis, Indiana,
United States of America : Wiley Publishing, Inc., 2009. - 978-0-470-
27774-4.
2. Crivat Bogdan, Grewal Jasjit Singh, Kumar Pranish and Lee Eric ATL
Server: High Performance C++ on .Net [Book]. Berkeley, CA, United
States of America : APress, Inc., 2003. - 1-59059-128-3.

Articles
3. Andonie Razvan, Crivat B [et al.] Fuzzy ARTMAP rule extraction in
computational chemistry [Conference] // IJCNN. - 2009. - pp. 157-
163. - DOI: 10.1109/IJCNN.2009.5179007.
4. Crivat, Ioan Bogdan SQL Server Data Mining Programability [Online]
March 2005 [Cited: 6 22, 2011.] http://msdn.microsoft.com/en-
US/library/ms345148(v=SQL.90).aspx.

Issued Patents (United States Patents and Trademark Office)

5. Crivat Ioan B, Petculescu Cristian and Netz Amir Explaining changes
in measures thru data mining [Patent] : 7899776. - United States of
America, 2011.
6. Crivat Ioan B, Petculescu Cristian and Netz Amir Random access in
run-length encoded structures [Patent] : 7952499. - United States of
America, 2011.
7. Crivat Ioan B, Iyer Raman and MacLennan C James Detecting and
displaying exceptions in tabular data [Patent] : 7797264. - United
States of America, 2010.
4

8. Crivat Ioan B, Iyer Raman and MacLennan C. James Dynamically
detecting exceptions based on data changes [Patent] : 7797356. -
United States of America, 2010.
9. Crivat Ioan B, Iyer Raman and MacLennan James Partitioning of a
data mining training set [Patent] : 7756881. - United States of
America, 2010.
10. Crivat Ioan B, Petculescu Cristian and Netz Amir Efficient Column
Based Data Encoding for Large Scale Data Storage [Patent] :
20100030796 . - United States of America, 2010.
11. Crivat Ioan B. [et al.] Extensible data mining framework [Patent] :
7383234. - United States of America, 2008.
12. Crivat Ioan Bogdan [et al.] Systems and methods that facilitate data
mining [Patent] : 7398268. - United States of America, 2008.
13. Crivat Ioan, B [et al.] Using a rowset as a query parameter [Patent] :
7451137. - United States of America, 2008.
14. Crivat Ioan, B, MacLennan C, James and Iyer Raman Goal seeking
using predictive analytics [Patent] : 7788200. - United States of
America, 2010.
15. Crivat Ioan, Bogdan [et al.] Unstructured data in a mining model
language [Patent] : 7593927. - United States of America, 2009.
16. Crivat Ioan, Bogdan, Cristofor Elena, D. and MacLennan C. James
Analyzing mining pattern evolutions by comparing labels,
algorithms, or data patterns chosen by a reasoning component
[Patent] : 7636698. - United States of America, 2009.
17. Crivat Bogdan [et al.] Systems and methods of utilizing and
expanding standard protocol [Patent] : 7689703. - United States of
America, 2010.

Pending patent applications (United States Patents and
Trademark Office)
18. Crivat Ioan Bogdan [et al.] Techniques for Evaluating
Recommendation Systems [Patent Application] : 20090319330 -
United States of America, 2009.

5

1 Introduction
1.1 Objectives
The objective of this work is a synthesis of some of the recent efforts in the
domain of predictive and associative rules extraction and processing as well
as a presentation of certain original contributions to the area.
As used in this work, data mining is the process of analyzing data in order to
find hidden patterns using automatic methodologies. Due in part to major
computational advances in the last decades, extensive research in the area
of data mining led to development of many classes of pattern extraction
algorithms. These algorithms are often employed in systems that yield high
accuracy predictions but the patterns detected by such algorithms are,
more often than not, difficult to interpret.
A direct consequence of this difficulty is the high barrier encountered by
data mining to acceptance in the common information workers toolset.
The author spent the most of the last decade as one of the principal
designers and implementers of the Microsoft SQL Server Data Mining
platform, a product with the goal of making data mining more accessible to
information workers. This work is strongly influenced by this industrial
perspective.
1.2 Contributions
This work synthesizes the original contributions of the author over a period
of time longer than the actual doctoral studies, as illustrated by the authors
6

patents and publications: [1], [2] , [3], [4], [5], [6], [7], [8], [9], [10], [11],
[12], [13], [14], [15]. This thesis does not include all these works, but it is
certainly a result of theirs.
Rule systems are collections of easily understandable patterns which often
can be translated to plain language statements. Significant research starting
around 1990 aimed to employ rule systems in data mining. This research
produced multiple algorithms for rule extraction as well as many techniques
for converting other patterns to rule sets.
Chapters 2 and 3 describe the area of rules mining. The author had an
extensive activity in this area, leading to the Data Mining with Microsoft
SQL Server [2] volume as well as to the materialization of some of the
authors patents, such as [8], [13] and [14], [7], [9]. The volume has been
translated to Russian and Chinese, becoming the reference work for the
users of the SQL Server Data Mining. Sections of chapters 2 and 3 are based
on this volume.

In Chapter 4 we investigate some of the efforts for simplifying rule sets and
generalizing rules. The original contributions were initially published in the
Proceedings of the 2009 edition of International Joint Conference of Neural
Networks, an IEEE conference, and received the Best Poster Award Runner-
up distinction. Contributions, presented in Section 4.4, include:
7

- Subsection 4.4.2: A novel method for post-processing a set of rules
in order to improve its generalization capability. The method is
developed specifically for rules extracted from a fuzzy ARTMAP
incremental learning system used for classification, hence for rule
generated indirectly (as Fuzzy ARTMAP does not directly produce
rules).
- Subsection 4.4.3: An extension of the aforementioned method to
rule sets from common rule extraction algorithms, such as apriori.
Experimental results suggest a 5 to 10 times reduction in size for the
rules set, essentially by removing redundant rules.
- Property 4.1: a theoretical result introduced and proven in
Subsection 4.4.3, describes the equivalence of generalizations for
predictive and associative rules and justifies the previously
mentioned extension.

Chapter 5 treats the area of measuring the accuracy of recommendation
systems. The original contributions were introduced, as a patent, in [16]
(Techniques for Evaluating Recommendation Systems). Contributions
include:
- Section 5.4: The itemized accuracy curve, a novel instrument for
evaluating the quality of recommendation systems. This diagram
offers some advantages over the existing accuracy curves,
advantages presented in Section 5.6
8

- Lemma 5.1 (in the same section) is an original theoretical result
which justifies using a certain recommendation system (MFnR, the
Most Frequent n-Items Recommender) as a baseline for comparing
other recommendation systems.
- Section 5.5: Implementation details and optimization suggestions
for the algorithm which computes the Itemized Accuracy Curve,
intended for large data sets.

1.3 The Structure of this Thesis
The second chapter, Rules in the Data Mining context, introduces
data mining with a focus on rule extraction. The CRISP-DM standard for the
life cycle of a data mining project is described, together with some business
problems commonly approachable by data mining techniques. We focus,
then, on rules in data mining. We discuss association rules and their
properties as well as some notions of fuzzy modeling and fuzzy rules.
The third chapter, Methods for Rule Extraction, presents the most
commonly used methods for extracting rules. We start by presenting some
algorithms designed specifically for rule extraction, such as apriori and FP-
Growth. We discuss some of the problems raised by these algorithms as
well as solutions identified for those problems. Next, we present some
techniques for extracting rules from patterns detected by other algorithms
and focus on rule extractions from neural networks, a topic of significant
9

interest in the next chapter. A special section describes the specifics of rules
analysis in Microsoft SQL Server.
The fourth chapter, Contributions to Rule Generalization, reviews
some of the existing methods for simplifying rule models, and focuses on
measures for detecting rules similarity. Similar rules can be merged,
resulting in simpler rule systems. By interpreting one of these similarity
measures from the data mining rules analysis perspective, a novel
generalization method is proposed, which reduces the complexity of certain
rule sets and improves the interpretability of the model. The method,
introduced for very specific rules extracted from a Fuzzy ARTMAP predictor,
is extended for rule sets discovered by algorithms such as apriori.
The fifth chapter, Measuring the Usage Prediction Accuracy of
Recommendation Systems, presents the area of accuracy measurements
for recommendation systems, one of the most common applications of
association rules. A new instrument for assessing the accuracy of a
recommender is presented, together with some experimental results.
The sixth chapter, Experimental Results, presents some experimental
results for the techniques introduced in the 4
th
and 5
th
chapters. The results
are detailed for datasets used in presenting the methods or compared
against results from other authors.
The last chapter contains conclusions of this thesis as well as certain
directions for further research.
10

2 Rules in the Data Mining context
The term data mining has been used, in various publications, to mean
anything from ad-hoc queries [17] or pivot chart analysis [18] to
government domestic spying programs [19]. As used in this work, data
mining is the process of analyzing data to find hidden patterns using
automatic methodologies. Parts of this process may be referred, in cited
works, as machine learning, predictive analytics or knowledge discovery in
databases (KDD).
In this chapter we describe the main problems addressed by data
mining as well as the CRISP-DM standard for the life cycle of a data mining
project. We focus, then, in Section 2.3, on rules in data mining. We discuss
association rules and their properties. Section2.4 discusses rules from a
different perspective, that of fuzzy modeling.
2.1 Data mining in industry: an overview
Over the past several decades, computing power, growing according to
Moores law [20], and significant advances in storage technology (following
the so-called Kryders Law) [21] led to production of unprecedented data
volumes. Various industry and academic studies [22,23], estimate the
amount of data produced or consumed in the world to the order of
zettabytes (1 ZB = 10
3
EC = 10
9
TB = 10
21
bytes). A large chunk of this data is
produced by business software, such as enterprise resource planning (ERP)
systems, customer relationship management (CRM) systems or database
servers. The result of this tremendous data production is that organizations
are rich in data and poor in knowledge. The sheer vastness of the data
11

collections makes practical use of the data stores limited. The main purpose
of data mining is to extract non-trivial, previously unknown and potentially
useful knowledge from the data [24].
2.2 Data Mining Problems, Tasks and Processes
Data Mining can be used in virtually all business applications, answering
various types of questions. Data mining technology was introduced initially
in highly specialized, dedicated software packages targeting highly skilled
specialists. More recently, though, data mining technology is perceived as a
commodity of any serious business intelligence platform. With a large offer
of data mining software integrated in major spreadsheet products, such as
11 Ants Model Builder [25], Predixion Insight [26] or SQL Server Data Mining
add-ins [27] , all one needs is the motivation and a bit of know how to apply
data mining on ones business data.
Unless otherwise specified the material in this chapter is based on our
previously published volume [2].
2.2.1 Business Problems
Examples of common business data mining problems include:
- Recommendation generation What products or services should
one offer customers? Generating recommendations is an important
business challenge for retailers and service providers. Customers
provided with accurate and timely recommendations are likely more
valuable (because they purchase more) and more loyal (they feel a
stronger relationship to the vendor). Such recommendations are
12

derived from using data mining to analyze purchase behavior of
other of the retailers customers.
- Anomaly detection how to detect whether data is good or not?
Data Mining can be employed to analyze data, detect patterns that
govern most of the data then pick out those items that do not fit the
rest by not matching the common patterns. Credit card companies
frequently use data mining to determine whether a particular
transaction is valid (typical) or likely to be a fraud. Insurance
companies use anomaly detection to determine whether claims are
likely to be fraudulent.
- Churn analysis The telecommunication, banking and insurance
industries face severe competition and every business would like to
retain as many customers as possible. Churn analysis may help
marketing managers identify the customers that are likely to leave
and why. As a result, named managers can improve customer
satisfaction and consequently retain those customers.
- Risk management Data mining techniques are used to determine
the risk of a loan application, helping loan officers make appropriate
decisions regarding the cost and validity of each loan application.
- Customer segmentation This kind of analysis determines the
behavioral and descriptive profiles for customers. These profiles can
then be used to provide personalized marketing programs and
strategies that are appropriate for each group.
- Targeted ads web retailers or portal sites personalize their content
for their web customers. Using navigation patterns or online
13

purchase patterns, these sites can use data mining solutions to
display targeted advertisements that are most likely to trigger the
interest of these customers.
- Forecasting data mining techniques can be used to identify trends,
periodicities and noise levels in various numeric series and then
extrapolate, based on these parameters, the future evolution of
those series. These techniques are frequently used for future sales
or consumption estimations and consequently for inventory
planning.
- Quality control for manufacturing processes data mining can
provide tremendous help in identifying the root cause of various
manufacturing processes failures and defects.
2.2.2 Implementation Tasks
From the perspective of modeling data and choosing a data mining
technique or algorithm to use, the business problems described above, as
well as many others, can be translated to one or more general data mining
tasks such as those described below:
- Classification is the most commonly used data mining task. It
consists in learning, from a set of data points, the patterns that lead
to correctly assigning a category to a new data point. The machine
learning algorithms employed in classification tasks depend on the
user to indicate which of the features of the training data points
contains the target category. They detect then patterns linking the
other data features to the target. Because a user needs to indicate
14

the learning target, the classification algorithms are said to perform
supervised learning.
- Clustering (or segmentation) consists in identifying natural groups
of data points based on a set of features of these data points. Points
within the same group have more or less similar features. Clustering
is an unsupervised data mining task, as none of the features is of
special importance to the machine learning process.
- Association detects common occurrences of specific features over
typically large data point populations. The next section of this
chapter discuss in more detail the association process
- Regression this task is very similar to classification, but rather than
assigning a category to a new data point, regressions goal is to
determine a numerical value. Simple linear line fitting techniques,
such as the Least Squares Method, are an example of regression.
More advanced regression algorithm support various non-numeric
features as inputs.
- Time Series Forecasting is the use of a model to forecast future
events based on known past events, in order to predict data points
before they are measured.

2.2.3 Data Mining Project Cycle
Starting in 1996, a cross-industry effort emerged to describe and formalize
the most commonly used approaches that expert data miners employ for
tackling various problems. This effort materialized in the first Cross Industry
15

Standard Process for Data Mining (CRISP-DM), an industry-neutral, tool-
neutral standard for approaching data mining problems. This standard is
now hosted and published by IBM and can be accessed here: [28] . The
CRISP-DM standard breaks down the data mining process in six major
phases. These phases are presented below, as described by CRISP-DM (from
[28]):
- Business understanding -- This initial phase focuses on
understanding the project objectives and requirements from a
business perspective, then converting this knowledge into a data
mining problem definition and a preliminary plan designed to
achieve the objectives.
- Data understanding -- The data understanding phase starts with
initial data collection and proceeds with activities that enable you to
become familiar with the data, identify data quality problems,
discover first insights into the data, and/or detect interesting
subsets to form hypotheses regarding hidden information.
- Data preparation. The initial raw data is seldom ready to be
consumed by modeling tools. The data preparation phase covers all
activities needed to construct the final dataset, ready for modeling
tools, from the initial raw data. Data preparation tasks are likely to
be performed multiple times and not in any prescribed order. Tasks
include table, record, and attribute selection, as well as
transformation and cleaning of data for modeling tools
16

- Modeling In this phase, various modeling techniques are selected
and applied, and their parameters are calibrated to optimal values.
Typically, there are several techniques for the same data mining
problem type. Some techniques have specific requirements on the
form of data. Therefore, going back to the data preparation phase is
often necessary.
- Evaluation -- At this stage in the project, you have built a model (or
models) that appear to have high quality from a data analysis
perspective. Before proceeding to final deployment of the model, it
is important to thoroughly evaluate it and review the steps executed
to create it, to be certain the model properly achieves the business
objectives. A key objective is to determine if there is some important
business issue that has not been sufficiently considered. At the end
of this phase, a decision on the use of the data mining results should
be reached.
- Deployment - At this stage in the project, you have built a model (or
models) that appear to have high quality from a data analysis
perspective. Before proceeding to final deployment of the model, it
is important to thoroughly evaluate it and review the steps executed
to create it, to be certain the model properly achieves the business
objectives. A key objective is to determine if there is some important
business issue that has not been sufficiently considered. At the end
of this phase, a decision on the use of the data mining results should
be reached.
17

The CRISP phases are iterative, as presented in Figure 2-1 below.

Figure 2-1 The CRISP-DM process (from [28])
2.3 Rules in Data Mining
Implementations of data mining processes (as described above) produce
comprehensible patterns of the data, patterns that can be used to describe
the data or make inferences about similar data pieces. These
comprehensible patterns can take multiple forms such as decision trees,
regression equations, distributions etc. Rules are a natural style of
representing patterns extracted from data. A rule consists of an antecedent,
or precondition (also known as left hand side, or LHS), and the consequent
or conclusion (right hand side, RHS). The antecedent consists of one or
18

more logical predicates (or tests). The conclusion consists in one or more
classes that apply to data points (instances) that satisfy the antecedent
conditions. For the scope of this work, the antecedent is a conjunction of
predicates. In this section, we will present a formal definition of rules and
some properties associated with rules. The rules are introduced from a
transactional database perspective (association rules). We will show how
this transactional perspective can be applied to classification rules (non-
transactional data).

2.3.1 Association Rules
Association Rules were introduced by Agrawal in [29] with the purpose of
analyzing vast transaction databases from large retail companies.
Let

be a set of items. Let D be a set of data points


(database transactions). Each transaction T is a set of items such that T _ I.
Each transaction typically has a unique identifier, called Transaction Id
(TxId). For a set of items A, T is said to contain A if and only if A_T.
An association rule is a logical statement of the form A B, where A B =
u and A, B _ I. A is the antecedent, while B is the consequent of the rule
and are both itemsets (i.e. sets of items from the catalog I).
An itemset containing k items is said to be a k-itemset. The number of
transactions containing an itemset is the count or support count of the
itemset. Note that the support supp(A) for an itemset may be expressed as
19

an integer (the absolute number of transactions containing the itemset) or
as a percentage (of transactions, out of the total transactions in the
database, that contain an itemset). For the purpose of this work, supp is
always a percent. The absolute support is always denoted as supp count.
It is important to notice that a rule is itself an itemset, consisting of the
items in the set.
Rules are characterized by certain properties. Among them, referred in this
work:
- Confidence, defined as:


(2.1)

Confidence can be interpreted as an estimation of the conditional
probability of finding the right hand side part of the rule among transactions
containing the left hand side of the rule, P(B|A)
- Lift, defined as



(2.2)


the ratio between the observed support and the expected support (if A
and B were independent)
20

- Importance , defined [2] as:



(2.3)

acts as a measure of interestingness for a rule.

2.3.2 Classifications of association rules
As discussed in detail [30], association rules can be classified based on
different criteria. The most common rules classifications systems are
presented below and used throughout this thesis.
Rules can be classified:
- Based on the type of values handled in the rule. If a rule associates
the presence of certain items, it is a Boolean rule. If a rule describes
associations between quantitative items or attributes, it is a
quantitative association rule.

- Based on the dimensions of data involved in the rule: if the LHS
component has a single predicate, referring only one dimension,
then the rule is called single-dimensional. Conversely, a rule
referring multiple dimensions is a multi-dimensional association rule.

21

- Based on the level of abstractions involved in the rule set. Items
used in the various predicates composing the rule may appear at
different levels of abstraction, for example:
(Age = 20-30)Computer Games
(Age = 20-30)Computer Software
Computer Software, in this example, is a higher level of abstraction
than Computer Games. Rule sets mined at different abstraction
levels consist of multilevel association rules. If all rules refer the
same abstraction level, then the set is said to contain single-level
association rules.

2.3.3 The Market Basket Analysis problem
Association rules mining finds interesting associations among a large set of
items. A typical example is the market basket analysis. This process analyzes
customer buying habits by finding associations between items that are
frequently purchased together by customers (i.e. appear frequently
together in the same transaction, are frequently placed together in the
same shopping basket).
Such data is typically represented in a database as a transaction table,
similar to Table 2-1.
Order Number Model
SO51176
SO51176
Milk
Bread
22

SO51177
SO51177
Bread
Butter
SO51178
SO51178
SO51178
Milk
Bread
Butter
SO51179
SO51179
SO51179
SO51179
Apples
Butter
Bread
Pears
Table 2-1 Representation of shopping basket data
For a data set organized like Table 2-1 the association rules concepts are
mapped like below:
- The space of items (I) is the set of all distinct values in the Model
column
- A transaction identifier (TxId) is a distinct value in the Order Number
column.
- A transaction, identified by a transaction id (e.g.S051176) is the
set of distinct Model values that are associated with all occurrences
of the specified transaction identifier.
- An itemset is a non-empty collection of distinct values in the Model
column
- A rule is, therefore, a logical statement like

(2.4)

where M
i
is an item
23

Learning a set of association rules from market basket analysis serves
multiple purposes, such as to describe the frequent itemsets or to generate
recommendations based on the shopping basket content. Generating
recommendations for a given shopping basket is generally a two-step
process:
- Identify the rules whose precondition matches the current shopping
basket content
- Sort these rules based on some rule property, confidence, lift or
importance being the most frequently used such properties, the
recommend those consequences at the top of the sorted list that
are not already part of the shopping basket

2.3.4 Itemsets and Rules in dense representation
The data in Table 2-1 can be thought as a normalized representation of a
(very wide) table, organized in attribute/value pairs, like below:
Tx Id Milk Bread Butter Apples Pears
SO51176 1 1 0 0 0
SO51177 0 1 1 0 0
SO51178 1 1 1 0 0
SO51179 0 1 1 1 1
Table 2-2 Shopping basket data as attribute/value pairs
For most attributes in Table 2-2, a value of 0 signifies the absence (and a
value of 1, the presence) of an item in a transaction.
24

The representation in Table 2-2 is not efficient for a large catalog and
typically impossible in most RDBMS which handle up to around 1000
columns, as described in [31], [32]. However, this representation allows
adding new attributes to a transaction, attributes that are not necessarily
related to the items that are present in the shopping basket. For example,
demographic information about the customer or geographical information
about the store where the transaction has been recorded may be added to
the table. This information is typically describing a different dimension of a
transaction. (A discussion of multi-dimensional data warehouses does not
make the object of this work, but [2] as well as [30] contain thorough
discussions of the concepts).
A representation such as the one in Table 2-2 is said to be dense, as all the
features are explicitly present in data, with specific values indicating the
presence (1) or absence (0) of an item in a transaction. By contrast, a
representation such as the one in Table 2-1 is said to be sparse, as features
are implied from the presence (or absence) of an item in a transaction.
2.3.5 Equivalence of dense and sparse representations
The dense and sparse representations of transactions are equivalent with
regard to rules and itemsets.
Let now A={A
i
} be the set of all the attributes that can be associated with a
densely represented transaction, attributes which may span multiple
dimensions (for the dense dataset, each attribute A
i
is a column in the
dataset)
25

Let V
i
={v
i
j
}, the set of all possible values of the A
i
attribute.
Let an item be a certain state of an attribute, A
i
=v
i
j
Under this definition of an item, the association rules can be thought over
the dense transaction space by mapping the concepts like below:
- The space of items (I) is the set of all distinct attribute/value pairs
- A transaction identifier (TxId) is a distinct value in the Order Number
column.
- A transaction, identified by a transaction id (e.g.S051176) is the
set of attribute/value pairs defining the transaction row.
- An itemset is a non-empty collection of attribute/value pairs.
With these concepts, a rule, defined in equation (2.4) becomes a logical
statement like:

{A
1
=v
1
, , A
i
=v
i
} {A
j
=v
j
, , A
n
=v
n
} (2.5)

It is interesting to notice that, in the particular case when the number of
items in the consequent is exactly 1, an association rule becomes a
predictive rule, as it can be used to predict, with a certain confidence, the
value of a single attribute. As we will show in Section 3.1.4 below,
association rules may be employed, in commercial software packages, to
produce predictive rules, by mapping dense datasets using this
representation.

26

Note that range type columns (attributes) may have a very large number of
states, so for such attributes the corresponding set of values V
i
may have
very high cardinality, leading to a very large number of 1-itemsets. Binning
(discretization) is often used to reduce the number of states of an attribute.
Rules which apply to range intervals are called quantitative association rules
(as opposed to the Boolean association rules which deal with qualitative
statements). Srikant and Agrawal, in [33], introduce a method of fine-
partitioning the values of an attribute and then combining the adjacent
partitions as necessary. This work also introduces a modified version of the
apriori rule detection algorithm (described in detail below), version which
detects quantitative association rules.
In a typical industrial system, the transaction table used to store this
information is likely to be significantly more complex. The item catalog may
contain millions of distinct items, a fact that raises significant challenges in
finding significant rules (more in the next section, Methods for Rules
Extraction). Also, in an industrial implementation, the transactions are
likely to be stored for analysis in a data warehouse, together with additional
related information, supporting multidimensional analysis of the data.
Dimensions associated with a transaction may include customer
information, time or geo-location information etc.

27

2.4 Fuzzy Rules
Fuzzy modeling is one of the techniques being used for modeling of
nonlinear, uncertain, and complex systems. An important characteristic of
fuzzy models is the partitioning of the space of system variables into fuzzy
regions using fuzzy sets [34]. In each region, the characteristics of the
system can be simply described using a rule. A fuzzy model typically consists
of a rule base with a rule for each particular region. Fuzzy transitions
between these rules allow for the modeling of complex nonlinear systems
with a good global accuracy. One of the aspects that distinguish fuzzy
modeling from other black-box approaches like neural nets is that fuzzy
models are transparent to interpretation and analysis (to a certain degree).
However, the transparency of a fuzzy model is not achieved automatically.
A system can be described with a few rules using distinct and interpretable
fuzzy sets but also with a large number of highly overlapping fuzzy sets that
hardly allow for any interpretation.
2.4.1 Conceptualizing in Fuzzy Terms
Supposing that a particular concept is not well defined, a function can
be used to measure the grade to which an event is a member of that
concept. E.g.: today is a rainy day may have a very low value for sunny
days, a higher value for an autumn day, and a very high value for a
torrential rain day.
This membership function is typically defined to have values in the [0,1]
space, with 0 meaning that the event does not belong at all to a concept,
and 1 meaning that an event completely belongs to a certain concept. Such
28

a membership function may look like a Gaussian bell, a triangle, a
trapezoid or, in general, any take shape in the 0-1 interval (see Figure 2-2)

Figure 2-2 Standard types of membership functions (from [34] )
2.4.2 Fuzzy Modeling
Fuzzy modeling is a technique for modeling based on data. The result of this
modeling is a set of IF-THEN rules, with fuzzy predicates which establish
relations between relevant system variables. The fuzzy predicates are
0
0.2
0.4
0.6
0.8
1
1.2
Crisp
Trapezoidal
Triangular
Sigmoid
Z-function
Gaussian
29

associated with linguistic labels, so the model is in fact a qualitative
description of a system, with rules like:
IF temperature is moderate and volume is small THEN pressure is low
The meanings of the linguistic terms moderate, small and low are defined
by fuzzy sets in the domain of the respective system variables. Such models
are often called linguistic models.
Different types of linguistic models exist:
- The Mamdani model [35] uses linguistic rules with a fuzzy premise
part and a fuzzy consequent part
- The Takagi Sugeno (TS) model [36] uses rules that differ from
Mamdani models in that their consequents are mathematical
functions instead of fuzzy sets.
In a Mamdani model, the inference is the result of the rule that applies
in a certain point. The rule base represents a static mapping between the
antecedent and the consequent.
The TS model is based on the idea that the rules in the model will have
the following structure:

R
i
: w
i
(IF X
1
is A
i1
AND AND X
n
is A
in
THEN Y
i
= f
i
(.)) (2.6)

Where:
- W
i
is the rule weight (typically, 1, but it can be adjusted)
- f
i
is usually a linear function of the premise variables, x
1
x
n

30

The inference (prediction) of a TS model is computed as



(2.7)

i.e. the weighted average of the consequences of all the rules, where N
is the number of rules, Y
i
is the contribution of a certain rule and
i
is
the degree of activation of the i-th rules premise. Given the input X=(x
1
,
x
2
, x
n
),
i
is computed like below (the product of the membership
function for all the predicates of the current rule)

(2.8)

Because of the linear structure of the rule consequents, well known
parameter estimation techniques (e.g. least squares) can be used to
estimate the consequent parameters.

31

3 Methods for Rule Extraction
In this chapter, we present the most commonly used methods for
extracting rules. Section 3.1 below presents some algorithms designed
specifically for rule extraction, such as apriori and FP-Growth. We discuss
some of the problems raised by these algorithms as well as solutions
identified for those problems. Next, in Section 3.3 we present some
techniques for extracting rules from patterns detected by other algorithms
and focus on rule extractions from neural networks, a topic of significant
interest in the next chapter. A special section, 3.2, describes the specifics of
rules analysis in Microsoft SQL Server.
3.1 Extraction of Association Rules
In this section we present some of the algorithms designed specifically for
the extraction of association rules as well as some results comparing the
real-life performance of various rules extraction algorithms.
3.1.1 The Apriori algorithm
Apriori is an influential algorithm for mining frequent itemsets for Boolean
association rules, introduced by Agrawal in [37]. The algorithm uses prior
knowledge of frequent itemset properties. Its purpose is to avoid counting
the support of every possible itemset derivable from I. Apriori exploits the
downward closure property of itemsets: if any n-itemset is frequent, then
all its subsets must also be frequent. Frequent, in this context, means that
the support (supp) of an itemset exceeds a minsup minimum support
parameter of the algorithm. Itemsets that appear less frequently than the
specified minimum support are considered infrequent and ignored by the
32

algorithm. An itemset generation and test algorithm that was not using the
apriori property was introduced also by Agrawal in [29].
The apriori algorithm is initialized by counting the occurrences of each
individual item, therefore finding the frequencies for all itemsets of size 1.
The algorithm does this by scanning the data set and counting the support
of each item. The 1-itemsets with a frequency lower than minsup are
removed. The remaining 1-itemsets constitute L
1
, the set of frequent 1-
itemsets that are interesting for the algorithm.
Once initialized, the algorithm performs iteratively the following steps:
1. The join step: a set of candidate n-itemsets, C
n
, is generated by
joining L
n-1
with itself. (By convention, apriori assumes that items
within a transaction are sorted lexicographically). The join is
performed by the compound key represented by the first n-2 items
in an itemset. Consider the (n-1)-itemsets A and B defined as below

A = {a
1
, a
2
, , a
n-2
, a
n-1
}
B = {b
1
, b
2
, , b
n-2
, b
n-1
}
(3.1)

A and B are joined if the share the join key, i.e. if

(3.2)


33

As a result of joining A and B on the compound key, a new candidate
n-itemset is produced and inserted in the C
n
set of candidate n-
itemsets:

C = {a
1
, a
2
, , a
n-2
, a
n-1
, b
n-1
} (3.3)

The

predicate in the join condition is not actually


part of the join key, but ensures that no duplicate itemsets are
generated as a result of the join.

2. The pruning step: Not all itemsets in the C
n
set of candidate n-
itemsets meet the minsup requirement. Determining those itemsets
that meet the minsup requirement can be done with a scan of the
database. However, this is not always possible or practical, as C
n
can
be huge. The downward closure property is now used to prune
some of the C
n
items. If any (n-1) component of an n-itemset
candidate is not frequent, then the candidate cannot be frequent.
This test can be done quickly by maintaining a quick lookup
structure (e.g. tree, hash table) of all the frequent itemsets
discovered so far. For this step, actually, only the n-1 frequent
itemsets L
n-1
need to be stored in memory.
Upon completion of the pruning step, the reminder of the set C
n
of
candidate n-itemsets becomes the set of frequent n-itemsets, L
n
. The
iteration stops when either L
n
is empty or the n+1 length of frequent
itemsets to be detected next exceeds a user defined threshold.
34


Figure 3-1: Finding frequent itemsets
Figure 3-1 illustrates the process of identifying frequent itemsets. The
minsup is set to 2.5% for a population of 10000 transactions, therefore 250.
At the first iteration, cheese and cake are filtered out. At the second
iteration, the candidate {diaper, milk} is disqualified. At the third iteration,
the candidate {beer, bread, diaper} has enough support, whereas the
candidate {beer, milk, bread} is filtered out, because it contains the {diaper,
milk} subset which has already been discounted as infrequent.
Once the frequent itemsets have been detected, the association rules can
be extracted easily. Typically, only rules exceeding a certain confidence
threshold are interesting. Let minconf be the minimum confidence
threshold, an algorithm input parameter. As mentioned in Section 2.3.1
above, the confidence of a rule A B is defined as

35


(3.4)

For each frequent itemset I, association rules can be generated like below:
- Generate all non-empty strict subsets {S
i
c I} of the itemset
- For every non-empty subset, S
i
, determine the confidence of the rule
R
i
: S
i
{I-S
i
}:


supp(R
i
) =

(3.5)


- If supp(R
i
)>minconf then add R
i
to the set of rules
The apriori method of detecting frequent itemsets may need to generate a
huge number of candidate sets. For example, if there are 10,000 frequent
items, the algorithm will need to generate more than 10
7
candidate 2-
itemsets and then scan the database in order to test their occurrence
frequencies. Some other issues raised by the apriori algorithm (and, in
general, by any algorithm driven by a minsupp parameter) are discussed in
Section 3.1.4 which treats the problem of rare rules.
3.1.2 The FP-Growth algorithm
The Frequent Pattern Growth (FP-Growth) algorithm was introduced by
Jiawei Han in [38] and refined in [39], with the purpose of extracting the
complete set of frequent itemsets, without candidate generation.

36


Figure 3-2 An FP-Tree structure
The algorithm uses a novel data structure, called a Frequent pattern Tree
(FP-tree). A FP-tree is an extended prefix tree which stores information
about frequent patterns. Only the frequent 1-items appear in the tree and
the nodes are arranged in such a way that the frequently occurring items
have better chances of node sharing than the less frequently occurring
ones. An item header table can be built to facilitate the trees traversal.
Figure 3-2 presents such a tree, together with the associated item header
table. Once an FP tree is built, mining frequent patterns in a database us
transformed to that of mining the FP-Tree. Experiments [38] show that such
a tree may be orders of magnitude smaller than the dataset it represents.
The full algorithm for building the tree is presented in Appendix A: Key
Algorithms. While building the tree, the item header table is updated to
37

contain a node link (pointer) to the first occurrence of each item in the tree.
Any new occurrence of the item in the tree (as part of a different sub-tree)
ends up being linked to the previous occurrence, so that from the item
header table one can traverse all the tree occurrences of each individual
item.
Each transaction in the database is represented on one of the paths from
the FP-tree root to a tree leaf. Consequently, for each itemset , any larger
itemset suffixed by may only appear on a path containing .
The second step of the algorithm consists in mining the tree to extract the
frequent itemsets. Each 1-itemset, in reverse order of the frequency, is
considered as an initial suffix pattern. By traversing the linked list of
occurrences of the initial suffix pattern in the tree, a conditional pattern
base is created, consisting of full prefix paths in the FP tree that co-occur
with the current suffix pattern. The conditional pattern base is used to
create a conditional FP-tree. This conditional tree is then mined recursively.
All the detected patterns are concatenated with the original suffix pattern
used to create the conditional FP-tree.
As opposed to Apriori, which performs restricted itemset generation and
testing, the frequent pattern mining algorithm performs only a restricted
testing of the itemsets. Also, the mining of the FP-tree is based on
partitioning, reducing dramatically the size of the conditional pattern base.
Refinements to the original FP-Tree mining algorithm are proposed in [39].,
including a method to scale the FP-tree mining by using database
38

projections. For a
i
a itemset and DB a database, the a
i
-projected database is
based on DB and contains all transactions which contain a
i
, after eliminating
from them infrequent items (all items that appear after a
i
in the list of
frequent items).
Also included in [39] is a comparative analysis of the FP-growth algorithm
and an alternative database projection-based algorithm, TreeProjection
(described in Section 3.1.3). The FP-Growth algorithm is determined to be
more efficient both in terms of memory consumption and computational
complexity.


3.1.3 Other algorithms and a performance comparison
Partition, an algorithm introduced in 1995 by Savasere et al. in [40]
generates all the frequent itemsets and rules in at most 2 scans of the
database. In the first scan, it divides the database in a number of non-
overlapping partitions and computes, for each partition, the frequent
itemsets. The union of these partition frequent itemsets is a superset of all
frequent itemsets, so it may contain items that are not globally frequent. A
second scan of the database is employed to compute the actual support for
all candidate itemsets (and remove those that are not globally frequent).
We mentioned, in the previous section, the TreeProjection algorithm. It was
introduced in 2000 by Agarwal, in [41], and uses a lexicographic tree to
39

represent the itemsets. Transactions are projected onto the tree nodes for
counting the support of frequent itemsets.
A different approach to rules mining is to discover the closed itemsets, a
small representative subset that captures frequent itemsets without loss of
information. This idea was introduced in 1999 by Pasquier et al. in [42]. An
algorithm to detect closed itemsets called CLOSE was introduced in the
same paper. After finding the frequent k-itemsets, Close compares the
support of each set with its subsets at the previous level. If the support of
an itemset matches the support of any of its subsets, the itemset is pruned.
The second step in Close is to compute the closure of all the itemsets found
in the first step. An improved version, A-CLOSE, was introduced in [43],
which generates a reduced set of association rules without having to
produce all frequent itemset, reducing, this way, the computational cost.
Charm is another algorithm for generating closed frequent itemsets for
association rules, introduced in [44]. Charm explores simultaneously the
itemset space as well as the transaction space and uses a very efficient
search method to identify the frequent closed itemsets (instead of
enumerating many possible subsets)

A 2001 study compared the performance of some of the commonly used
rules or frequent itemset detection algorithms [45]. Apriori, FP-Growth and
TreeProjection were included among the tested algorithms. The study used
three real-world datasets as well as one artificial dataset, T10I4D100K from
40

IBM Almaden. The original URL indicated for the data generator,
http://www.almaden.ibm.com/software/quest/Resources/index.shtml,
seems unavailable now (June 2011), but the test datasets can be
downloaded from http://fimi.ua.ac.be/data/ ). The algorithm performances
claimed by their respective authors were confirmed on artificial datasets,
but some of these gains did not seem to carry to the real datasets. As
reported in [45], a very quick growth in the number of rules is associated
with very small changes in the minimum support threshold, suggesting that
the choice of algorithm only matters at support levels that generate more
rules than would be useful in practice.


3.1.4 Problems raised by Minimum Support rule itemset
extraction systems
The most commonly used algorithms for rule extraction, apriori and FP-tree,
just like most of the other algorithms mentioned previously, focus on
finding frequent itemsets, i.e. itemsets that exceed a certain minimum
support. All itemsets (and, consequently, rules) that do not meet the
minsup threshold are ignored by these algorithms.
Rules with low support and high confidence, however, may be very
interesting for certain applications, particularly for e-Commerce
applications which aim to yield high profit margin by suggesting customers
items of interest. Customers with exotic tastes may be a small minority, but
41

they share, in their respective clusters, similar interests, and
recommendation systems should, at least theoretically, be able to make
good appropriate suggestions in their case.
Rare rules may be of two forms:
- Both the antecedent and the consequent have small support and fail
the minsup test. In this case, they are never considered by common
algorithms
- The predicates in antecedent and/or the consequent exceed the
minsup criterion, but they only rarely co-occur, and the combination
ends up being ignored by the algorithms.
The simple solution of reducing the minsup threshold will not function
practically. On a theoretical level, the minimum support criterion is what
makes both apriori and FP-tree practical for large datasets. The comparative
study in [45] (discussed in Section 3.1.3 above) shows that, on certain
datasets, small reductions on minimum support value may lead to
extremely rapid growth in the number of rules.
Some research has been carried recently in the area of rare rules detection.
A collection of most significant results in this area is available in [46]. A few
different approaches have been taken in solving this problem.
One of the approaches consists in using a variable minimum support
threshold. Each itemset may have a different support threshold, which can
be predefined or can be dynamically lowered to allow for rare itemset
inclusion.
42

Multiple Support Apriori (MSApriori), introduced in [47], allows each
database item to have its own minimum support. The minimum support for
an n-itemset, n>1, is computed as the minimum per components. To
facilitate the detection of small support itemsets, the items are sorted in
ascending order of their minimum support values rather than in the
conventional lexicographic order used by apriori. As it is impractical to
associate an individual minimum support with each item in a large product
catalog, the authors suggest a Lowest Allowable Minimum Support (LS) and
a constant e[0,1] as algorithm parameter. An arbitrary items minimum
support will then be

(3.6)

The algorithm detects certain rare itemset and rules, but the criterion is
users value rather than the frequency of items.

Relative Support Apriori, introduced in [48], is a refinement on top of
MSApriori which avoids the user input (the parameter of the MSApriori
algorithm) and defines a new threshold for itemsets, the relative support,
which measures the confidence of rare items. The relative support
threshold (defined below) imposes a higher support limit for items that are
globally infrequent.

43



(3.7)


Adaptive Apriori, introduced in [49], introduces the idea of support
constraints, a function which produces minimum support for specified
itemsets. Multiple constraints are combined by picking the minimum. The
resulting apriori implementation generates only necessary itemsets, i.e.
itemsets that meet the set of predefined constraints.

LPMiner, introduced in [50], also uses a variable minimum support
threshold. The authors propose a support threshold which decreases with
the length of the itemset. The implementation is based on the FP-tree
algorithm.

A very different approach consists in completely eliminating the minimum
support threshold.
A family of algorithms based on MinHashing is presented in [51]. These
algorithms detect rare itemsets of very highly correlated items. The
algorithms represent transactions, conceptually, as a 0/1 matrix with one
row per transaction and as many columns as distinct items. In this
representation, the confidence of a rule is the number of rows with 1 in
44

both columns divided by the number of rows with1 in either column. This
representation is not practical, as it would be very large. The authors
suggest computing a hashing signature for each column so that the
probability that two columns have the same signature is proportional to
their similarity.
An example of such a hash, a random order of rows is selected and a
columns hash is the first row index (under the new order) where the
column has a 1. The article shows that the probability that two columns
share a signature is proportional to their similarity.
To reduce the number of false positives and false negatives, multiple
signatures are selected (by repeating the process independently) . The
resulting candidate pairs are generated and checked against the real
database (the original matrix). The algorithm is implemented for rules with
2 or 3 itemsets but is not yet extended beyond this size.

Apriori Inverse, proposed in [52], is also a variation of the apriori algorithm
but it uses maximum support instead of minsup. Candidates of interest are
below maxsup, but still above an absolute minimum support (minabssup,
noise threshold). A rule X is interesting if sup(X)<maxsup AND
sup(X)>minabssup.

45

Apriori Rare, proposed in [53], splits the problem of detecting rare itemsets
in two tasks. The authors introduced the concepts of:
- Maximal frequent itemset (MFI), an itemset which is frequent, but
all its proper supersets are rare
- Minimal rare itemset (mRI) , a rare itemset having all proper subsets
frequent
- Generator, an itemset that has no proper subset with the same
support (i.e. c )
The mRIs can be detected naively, using apriori, or by using a new algorithm
introduced in the paper, called MRG-Exp, which avoids exploring all
itemsets and instead only looks for frequent generators in the itemset
lattice. The second part consists in restoring rare itemsets from mRIs, using
an algorithm called Arima (A Rare Itemset Miner Algorithm).

3.2 An implementation perspective: Support for association
analysis in Microsoft SQL Server 2008

This section describes some of the innovations supporting association
analysis in the Microsoft SQL Server Analysis Services 2008 platform (AS)
as a context for some for some of the work presented in this document. We
originally published the core of this material in our previously published
volume [2].
46

Analysis Services separates between data storage objects (mining
structures), which are in essence multi-dimensional data stores , and mining
models, instantiations of data mining algorithms on top of projections of
mining structures.
In the simplest case, a mining structure may be a table. A mining model
belonging to that structure may use some or all of the table columns.
Mining structure columns can be referred more than once in the same
mining model. A mining case is a data point used in training a data mining
algorithm or one that needs to be scored by a trained algorithm.

A significant innovation in the AS product is the concept of nested tables.
From data mining modeling perspective, a nested table is a tabular feature
of a mining case.

Figure 3-3 A mining case containing tabular features
Figure 3-3 represents such a mining case (a customer, in this case). The case
contains certain scalar features such as Key (a unique identifier), Gender,
Age or name. Tabular features, such as the list of purchases or ratings
47

produced by this costumer for certain movies, can also be logically
associated with the customer.
From a relational database perspective, a customer with the related
information is represented as join relationships between several tables.
Figure 3-4 presents the relational database structure associated with the
mining case represented by the customer plus purchases and movie ratings.
A mining structure can store data from multiple tables and models built
inside that structure can access data from multiple tables as features of the
same mining case. The modeling of nested tables is centered on the key
columns of the nested table. Each individual value of a nested table key is
mapped to one or more modeling attributes.

Figure 3-4 A RDBMS representation of the data supporting mining cases with nested tables
48

For example, consider a classification model that aims to predict a
customers age based on gender and the lists of purchases as well as movie
ratings. Each mining case will have the following attributes:
- Gender , Age from the People table
- Purchases(Milk), Purchases (Bread), Purchases (Apples) all of
them with values of Existing/Missing
- Purchase(Bread).Quantity, Purchases(Milk).Quantity,
Purchases(Apples).Quantity either missing or mapped to the
Quantity column of the Purchases relational table
The feature space for a mining case is very wide and contains all possible
values for each nested table key (and the related attributes). However, a
mining case is represented sparsely, only those nested attributes having the
Existing state are presented to the mining algorithm. Given that the mining
algorithm has full access to the feature space information (dimensionality,
data types), it can effectively mine the very large feature space.
The abstraction on top of the physical feature set is part of the AS platform
and all the data mining algorithms running on the AS platform must,
therefore, support sparse feature sets.
The nested table concept in AS allows mining complex patterns directly
from relational database, without a need to move the data to an internal
representation.

49

The nested tables are particularly useful in mining association rules, as they
map to the database representation of transactions. Using equivalence
(shown in Section 2.3.4 above) between the transactional and tabular data
for the association rules algorithm, the result is an implementation that can
detect association rules between nested table items (transactional items)
and scalar features. The AS implementation of association rules is,
therefore, able to produce associative rules combining multidimensional
predicates, like below:



(3.8)



Models, inside mining structures, use projections of the data in the
structure. Columns from the mining structure may appear once, multiple
times or not at all in a model. Rows of the mining structure may be filtered
out of models as well.
50


Figure 3-5 Using a structure nested table as source for multiple model nested tables
Figure 3-5 presents an example of complex modeling using filters:
- The mining structure on left contains a single nested table with 2
columns: product name and a flag indicating whether the product
was On Sale when purchased or not
- A model is built inside the mining structure, containing two nested
tables, both linked to the single mining structure nested table, but
with different row filters.
Rules can be mined, now, to detect how On Sale products drive sales of
other products.
3.3 Rules as expression of patterns detected by other
algorithms
The descriptive power of rules makes them a frequently used tool for
explaining the patterns extracted by various machine learning algorithms.
51

3.3.1 Rules based on Decision Trees
Decision trees building algorithms are frequently used for rule extraction.
Trees induction methods are producing patterns that can be easily be
converted to rule sets. Every node in a classification tree (such as ID3,
iterative dichotomiser 3, introduced by Quinlan [54]) or classification-and-
regression-trees (CART, introduced by Breiman at al., [55]) can be easily
converted to a rule by treating the full path, from root to the respective
node, as antecedent and the histogram of the node as consequent.

Collections of trees (forests) can be used to extract association rules, similar
to the ones detected by the apriori algorithm. An example for this is
implemented in Microsofts SQL Server data mining product, as we
described in [2]. In such an implementation, a tree is built for each item in
the item catalog, with the purpose of extracting rules that have that
respective item as a consequent. Figure 3-6 shows such a tree, built for the
Eyes Wide Shut movie item as a consequent. An example of such rule is:
R1: (Full metal Jacket)(Eyes Wide Shut)
supp(R1)=(total support for the leaf node)=56
conf(R1)=(from the histogram of the leaf node)=11/56=0.1964
52


Figure 3-6 A decision tree built for rules extraction (part of a SQL Server forest)

3.3.2 Rules from Neural Networks
An artificial neural network (ANN) is a mathematical (or computational)
model inspired from functional aspects of biological neural networks. An
ANN consists of groups of artificial interconnected neurons. A very
thorough description of artificial neural networks does not make the object
of this work and can be found in [56]. Some concepts and properties of
ANNs that are relevant to this work are summarized from [56] in this
section.
53


Figure 3-7 An artificial neural network
Each artificial neuron is a simplified abstraction of a biological neuron. A
neuron receives one or more inputs and sums them to produce an output. A
neuron typically combines the inputs by means of some weighted sum, and
then the result is passed through a non-linear function called activation or
transfer function for the neuron. The output of a neuron is:

(3.9)


54

Where:
- m is the number of inputs for the current neuron
- w
kj
is, respectively, the weight associated with the connection
between input j and the current neuron
- x
j
is the actual input value
- is the activation function for the neuron.
Frequently used activation functions include the step function or a sigmoid
function.
The ANN in Figure 3-7 has neurons disposed in 3 layers: an input layer, a
hidden one and an output layer. Complex systems may have more hidden
layers. For the purpose of this work, networks can be organized as any
directed acyclic graph (feed forward networks).
An artificial neural network is usually defined by
- The topology of the network (the connections between neurons)
- The learning process for updating the weights of the
interconnections
- The activation functions of the neurons

Neural networks can be used to model complex relationships between
inputs and outputs and are frequently employed in such tasks as
classification or pattern recognition.
55

More complex neural network types were proposed for modeling complex
biological processes, such as cortical development and reinforcement
learning. The Adaptive Resonance Theory (ART), for example, described in
detail in [57], is a special kind of neural network with sequential learning
ability.
The internal structure of neural networks, specifically the presence of the
hidden layers, makes them capable of solving certain classes of difficult
classification problems (such as the non-linearly separable problems). It is
the same complexity, on the other hand, that makes neural networks less
intuitive and more difficult to interpret. A very large corpus of research has
been produced in the last decades on changing the black-box status of
neural networks and exposing the patterns inside.
Three classes of techniques are often used to describe the patterns learned
by a neural network:
- Visualization of the neural network consists of directly describing
the network topology, the weights associated with the connections
and the activation functions of the neurons
- Sensitivity analysis consists in probing the ANN with different test
inputs then recording the outputs and determining, in the process,
the impact or effect of an input variable on the output.
- Rule extraction consists in producing a set of rules that explain the
classification process
56

Visualization and sensitivity analysis do not make the object of this work.
The rest of this section presents some of the methods used in extracting
rules from neural network.
The rules extracted from a network may be crisp or fuzzy. A crisp rule is a
proposition offering crisp Yes and No answers, such as the one below:


(3.6)

A fuzzy rule is a mapping from the X input space to the space of fuzzy class
labels, as described in Section 2.4 above.
While chronologically not the first work in the area of rule extraction from
neural network, a 1995 survey on rule extraction, [58], is of particular
interest, as it introduced a frequently used taxonomy of the methods used
for rule extraction from ANNs, based on the expressive power of the rules,
the translucency of the technique (relationship between rules and ANNs
structure), quality of the rules (accuracy, fidelity to the ANNs
computations, comprehensibility), algorithmic complexity and the
treatment of variables. The taxonomy has been updated in 1998 in [59] to
cover a broader range of ANNs, such as recurrent networks. One of the first
methods for extracting rules from a neural network was proposed by Saito
and Nakano in 1988, in [60]. It is a sensitivity analysis approach, which
observes the effects that changes in the inputs cause on the network
output. The problem raises challenges due to the large number of input
combinations that need to be evaluated. The authors employ a couple of
57

heuristics to deal with this problem, such as limiting the number of
predicates that may appear in an input.
In 1999, it is shown, in [61], that multilayer feed-forward networks are
universal approximators, i.e. can uniformly approximate any real
continuous function on a compact domain. In 1994, the same thing is
shown, in [62], for certain fuzzy rules based systems (FRBS), specifically
fuzzy additive systems, i.e. systems based on rules such as:




(3.11)

where p
jk
is a linear function on the inputs.
This equivalence led authors to discuss the equivalence of neural nets and
fuzzy expert systems, as shown in [63]. In 1998, Benitez at al. offer a
constructive proof in [64] for the equivalence of certain neural networks
and certain fuzzy rules based systems (FRBS). They show how to create a
fuzzy additive system from a neural networks with 3 layers (single hidden
layer) which uses a logistic activation function in hidden neurons and an
identity function in output neurons. The area of neuro fuzzy systems is
particularly interesting in the context of this work as it provides context for
some of the results presented in Chapter 4 below.
More work on the level of equivalence between fuzzy rule-based systems
and neural networks is presented in [65]. The authors provide a survey of
58

neuro-fuzzy rule generation algorithms. This work is used in 2005 in [66] to
extract rules IF-THEN rules from a fuzzy neural network and explain to drug
designers, in a human-comprehensible form, how the network arrives at a
particular decision.
More recently, in 2011, Chorowski and Zurada introduced a new method in
[67], called LORE (Local Rule Extraction), suited for multilayer networks with
logical or categorical (discrete) inputs. A multilayer perceptron is trained
under standard regime and then converted to an equivalent form that
mimics the original network and allows rule extraction. A new data
structure, the Decision Diagram, is introduced, which allows efficient partial
rule merging. Also, a rule format is introduced which explicitly separates
between subsets of inputs for which the answer is known from those with
an undetermined answer.


59

4 Contributions to Rule Generalization
This chapter is organized as follows. The first subsection describes some
concepts related to fuzzy rules generalization and simplification, while the
second section briefly discusses several methods for optimizing and
simplifying the rule sets. The third section focuses on one of these methods
(the Rule Base Simplification based on Similarity Measures).
The fourth section presents a rule generalization algorithm introduced in [1]
for rules extracted from Fuzzy ARTMAP classifiers. The algorithm is then
adapted to rule sets produced by common rule extraction algorithms, such
as apriori. The last section contains some ideas for further research and
some conclusions.
4.1 Fuzzy Rules Generalization
One of the aspects that distinguish fuzzy modeling from other black-box
approaches like neural nets is that fuzzy models are, to a certain degree,
transparent to interpretation and analysis. However, the transparency of a
fuzzy model is not achieved automatically. A system can be described with a
few rules using distinct and interpretable fuzzy sets but also with a large
number of highly overlapping fuzzy sets that hardly allow for any
interpretation.
Description of a system using natural language is an advantage of fuzzy
modeling. A simplified rule base makes it easier to assign qualitatively
meaningful linguistic terms to the fuzzy sets, and it reduces the number of
60

terms needed. It becomes easier for experts to validate the model and the
users can understand better and more quickly the operation of the system.
A model with fewer fuzzy sets and fewer rules is also better suited for the
design and implementation of a nonlinear (model-based) controller, or for
simulation purposes, and it has lower computational demands. Several
methods have been proposed for optimizing the size of the rule base
obtained with automated modeling techniques, and some of them are
discussed in this chapter. One of them, discussed in detail in Section 4.3 on
Similarity Measures and Rule Base Simplification, consists in measuring the
similarity of fuzzy rules and sets and merging them in order to simplify the
model. We build on the concepts introduced by this work and propose a
new method of simplifying the rule set by generalizing the rules in the
model, using data mining rule concepts such as support and accuracy.

4.1.1 Redundancy
Fuzzy models, especially if acquired from data, may contain redundant
information in the form of similarity between fuzzy sets. Three unwanted
effects that can be recognized are
1) Similarity between fuzzy sets in the model;
2) Similarity of a fuzzy set to the universal set;
3) Similarity of a fuzzy set to a singleton set.
61

As similar fuzzy sets represent compatible concepts in the rule base, a
model with many similar fuzzy sets becomes redundant, unnecessarily
complex and computationally demanding.
Some of the fuzzy sets extracted from data may be similar to the universal
set. Such fuzzy sets are irrelevant. The opposite effect is similarity to a
singleton set. During adaptation, membership functions may get narrow,
resulting in fuzzy sets almost like singletons (spikes). If a rule has one or
more such fuzzy sets in its premise, it will practically never fire, and thus the
rule does not contribute to the output of the model. However, it should be
noted that such rules may represent exceptions from the overall model
behavior
4.1.2 Similarity
Different measures have been proposed for similarity of fuzzy sets. In
general, they can be divided in
- Geometric similarity measures (e.g Minkowski class of distance
functions)



(4.1)


- Set-theoretic similarity measures (e.g. consistency index):

]
(4.2)

Where is the minimum operator
62

Setnes et al., in [68], describe some of the problems associated with using
these measures. The paper defines a set of criteria for such a measure and
introduces such a measure, which will be discussed in detail in Subsection
4.3 below.
4.1.3 Interpolation based rule generalization techniques
Takagi-Sugeno and Mamdani models perform inferences under the
assumption that the rule set completely covers the inference space (i.e. it is
dense). Interpolative reasoning methods address the problem of sparse rule
sets, which do not cover the whole inference space.
Mizumoto and Zimmermann, in [69], analyze the properties of rule
models and the possibility to interpolate new rules in the generalized
modus tollens. A modus tollens rule may be written, in logical operator
notation, as



(4.3)

In 1993, in [70], Kczy and Hirota propose a method (KH-rule
interpolation) for interpolations where results are inferred based on
computation of each o-cut level, and the resulting points are connected by
linear pieces to yield an approximate conclusion.

63

4.2 Rule Model Simplification Techniques
Extensive research is available for rule model simplification techniques.
Such techniques may target the feature set considered for rule inference,
the definition of the fuzzy sets participating in the rules or the structure of
the rules models.
4.2.1 Feature set alterations
Feature set alteration techniques share the goal of reducing the number of
features that participate in the inference process. A direct consequence of
applying such alteration techniques is that they result in simplified rule
systems, because a reduction in the number of features implies a smaller
number of predicates in rules premises. Such alterations can be classified
as Feature Extraction or Feature Selection techniques.
Feature Extraction techniques allow synthesizing of a new, lower-
dimension feature set which encompasses all or most of the variance of the
original feature set (i.e. the original information is preserved or the loss is
minimal). Such techniques include Principal Component Analysis (aka
Karhunen-Loewe transform), described in [71], which consists in identifying
the eigenvectors of the covariance matrix of the training data and
projecting the data on these eigenvectors. The eigenvalues associated with
these eigenvectors provide a measure of the variance of the whole system
along these vectors and consequently allow sorting the new coordinates
(the eigenvectors) in the order of variance. Frequently, for real data sets, a
low number of eigenvectors can account for 95% or more of the variance in
data.
64

A similar feature extraction technique is Sammons non-linear
projection [72]. In this approach, a set of high-dimensional vectors are
projected in a low-dimension space (2 or 3 dimensions) and a gradient
descent technique is used to adjust the projections so that the distance
between projections is as close as possible to the distance between the
original pairs of vectors. As the preservation of the semantic meaning is a
major advantage of the fuzzy rule models, techniques for feature
transformation (which inherently alter the models semantics) are not
treated in depth in this paper.
Feature Selection techniques do not create new features, but rather
identify the top most significant features to be used in building a model. On
real data sets, this approach often provides very good results because of
redundancy, co-linearity or irrelevance of certain data dimensions. Dash
and Liu, in [73], provide an extensive overview of the feature selection
techniques commonly used in classification systems. A very popular
technique for feature selection is the information gain method, introduced
in [54]. The information gain feature selection method sorts the input
features by the amount of entropy they reduce from the whole system and
can be used to determine which features should be retained, by keeping
those whose information gains are greater than a predetermined threshold.
Feature selection does not affect the semantic meaning of the rule model
and is used for rule simplification techniques.
65

4.2.2 Changes of the Fuzzy sets definition
Song et al., in [74], suggest using supervised learning to adapt the
parameters of the of the fuzzy membership functions defining the
components of the rules. With the assumption that the inference surface is
relatively smooth, over-fitting of the fuzzy system can be detected in two
ways. Two membership functions coming sufficiently close to each other
can be fused into a single membership function, and membership functions
becoming too narrow can be deleted. In both cases, this adaptive pruning
improves the interpretability of the fuzzy system. This approach is related to
our proposed method for rules generalization and the methods will be
compared in Subsection 4.4 below.
4.2.3 Merging and Removal Based Reduction
Automatically generated rule systems often produce redundant, similar,
inconsistent or inactive rules. Handling of similar rules is detailed in the next
section, covering Similarity Measures and Rule Base Simplification.
Inconsistent rules destroy the logical consistency of the models. Xiong and
Lits, in [75], propose a consistency index numerical assessment which
helps measuring the level of consistency/inconsistency of a rule base. They
use this index in the fitness function of a genetic algorithm which searches a
set of optimal rules under two criteria: good accuracy and minimal
inconsistency.
66

4.3 Similarity Measures and Rule Base Simplification
Setnes at al., in [68], propose a similarity measure for rules in a model.
Based on this measure, similar fuzzy sets are merged to create a common
fuzzy set to replace them in the rule base, with the goal of creating a more
efficient and more linguistically tractable model.
A similarity measure for two fuzzy sets, A and B, is defined as a function

[] (4.4)


A set of 4 criteria for a similarity measure is first introduced in [68]:
- Non-overlapping fuzzy sets should be totally non-equal.
That is,

(4.5)

- Overlapping fuzzy sets should have a similarity value
greater than 0

(4.6)

- Only equal fuzzy sets should have a similarity value of 1

(4.7)

67

- Similarity between two fuzzy sets should not be
influenced by scaling or shifting the domain on which
they are defined

With these criteria, [68] proposes a new similarity measure, based on set
theory, defined as:





(4.8)

This measure is, therefore, the ratio between the cardinality of intersection
and reunion of the sets. When the equation is rewritten using the
membership functions, in a discrete space X=(x1, x2, , xn) it becomes:




[

(4.9)


The operators are, respectively, minimum () and maximum (). This
similarity measure complies with the four criteria above and reflects the
idea of gradual transition from equal to completely non equal fuzzy sets/
With this measure defined, [68] proceeds to simplifying the rule base. Rules
that are similar to the universal fuzzy set (S(A,U)~1, x in X) can, for
example, be removed.
68

The paper also provides a solution for merging similar rules. For this, it uses
a parametric trapezoidal representation of fuzzy sets, each rule being
described by parameters:





(4.10)

The merging of two similar fuzzy sets, A and B, defined by
A
(x; a
1
, a
2
, a
3
,
a
4
) and
B
(x; b
1
, b
2
, b
3
, b
4
) is defined as a new fuzzy set, C, defined by
C
(x;
c
1
, c
2
, c
3
, c
4
), where:

c
1
= min(a
1
, b
1
)
c
4
= max(a
4
, b
4
)
c
2
=
2
a
2
+ (1-
2
b
2
)
c
3
=
3
a
3
+ (1-
3
b
3
)
(4.11)


In the definition of the C fuzzy set,
2
,
3
are between 0 and 1 and
determine which fuzzy set, A or B, has more influence on the newly
generated set C, with a default value for both of 0.5.
69


Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B (from [68])

With the merging solution described above, the authors propose an
algorithm for simplifying the rules in the model. The algorithm performs the
following steps:
- Select the most similar pair of fuzzy sets
- If the similarity score exceeds a certain parameter, ,
then merge the two fuzzy sets and update the rule set
- Repeat until no pair of fuzzy sets exceeds the threshold
- For each rule in the system, compute the similarity with
the universal set (U,
U
(x)=1 x in X). If the similarity with
the universal set exceeds a certain threshold , then
remove the rule from the set (too universal)
- Merge the rules with identical premise part
70


Figure 4-2 Merging of similar rules (from [68])
Further work in [76] refines the method in [68] by the following steps:
- Reduce the feature set by feature selection
- Apply the method in [68]
- Apply a Genetic Algorithm to improve the accuracy of the
rules. To maintain the interpretability of the rule set, the
genetic algorithm step is restricted to the neighborhood
of the initial rule set

4.4 Rule Generalization
In [1], four molecular descriptors are used (molecular weight, number of H-
bond donors and acceptors, and ClogP) to predict biological activity (IC
50
). In
the paper, we introduced a novel rule generalization algorithm and a rule
inference procedure able to improve the rules extracted from a neural
71

network. This section describes the rule generalization algorithm, discusses
the results and proposes some directions for further research.
4.4.1 Problem and context
In [1], the IC
50
prediction task uses a FAM-type prediction technique called
Fuzzy ARTMAP with Relevance (FAMR).
The Adaptive Resonance Theory (ART), described in detail in [57], is a
special kind of neural network with sequential learning ability. ARTs pattern
recognition features are enhanced with fuzzy logic in the Fuzzy ART model,
introduced in [77].
The FAMR is an incremental, neural network-based learning system used for
classification, probability estimation, and function approximation,
introduced in [78]. The FAMR architecture is able to sequentially
accommodate input-output sample pairs. Each such pair may be assigned a
relevance factor, proportional to the importance of that pair during the
learning phase.
FAM networks have the capability to easily expose the learned knowledge
in the form of fuzzy IF/THEN rules; several authors have addressed this issue
for classification tasks, such as [79] , [80]. The final goal in generating such
rules would be to explain, in human-comprehensible form, how the
network arrives at a particular decision, and to provide insight into the
influence of the input features on the target. To the best of our knowledge,
no author has discussed FAM rule extraction for function approximation
tasks, such as IC
50
prediction.
72

Carpenter and Tan, in [79] and [81] were the first who introduced a FAM
rule extraction procedure. To reduce complexity of the fuzzy ARTMAP, a
pruning procedure was also introduced. In [1] we adapt Carpenter and Tans
rule extraction method for function approximation tasks with the FAMR.
4.4.2 The rule generalization algorithm
Let O be the set of rules extracted from the FAMR model. In this section,
the quality of the rules in O is analyzed from the perspective of the
confidence (conf) and support (supp) properties described in Section 2.3.1
above.
The rules in O have support between 0.0% and 16.47%, and confidence
between 0.00% and 100.00%. To ensure the quality of the final rule set, we
use a minimum confidence and a minimum support criterion for the output
rules and prune the rules, from the extracted set, which do not meet these
minimum support and confidence criteria.
The set of rules extracted this way has the following characteristics:
- All rules are complete with regard to the input descriptors (the
antecedent of each rule contains, therefore, one predicate for each
descriptor), a consequence of the rule extraction algorithm.

- Certain descriptor fuzzy categories do not appear in any rule.


To further analyze this rule set, we introduce two new measures for the rule
set:
73

- Coverage: The percentage of training data points which have the
following property: There exists at least one rule for which the
molecules descriptors fall within the range of the antecedent (i.e.
the percentage of points for which at least one rule is triggered).

- Accuracy: The percentage of training data points which have the
following property: There exists at least one rule for which the
molecules descriptors fall within the range of all antecedents and, in
addition, the output falls within the range of the consequent (i.e. the
percentage of points for which a correct rule is triggered).

Assuming that some rules are too specific to the training set (over fitting),
we attempt to generalize them, by applying a greedy Rule Generalization
Algorithm (RGA). The RGA is applied to each rule in the set.
Rule Generalization Algorithm (RGA). Let a rule R be represented as

R: (X
1
= x
1
,X
2
= x
2
, . . . ,X
n
= x
n
) (Y = y) (4.12)

Relax R by replacing one predicate X
i
= x
i
with a wild card value,
representing any possible state and designated by the (X
i
= ) notation. By
definition, the newly formed rule has the same or better support, as its
antecedent is less restrictive. If the newly formed rules confidence meets
the minimum confidence criterion, then keep it in a pool of candidates. This
procedure is applied for all the predicates in the rule, resulting in at most n
generalized rules (where n is the number of predicates in the original rule)
which have support better or equal with the original rule. If the candidate
pool is not empty, replace the original with the candidate which maximizes
the confidence. The algorithm is applied recursively to the best
74

generalization and it stops when the candidate pool is empty (no better
generalization can be found).
The RGAs goal is to relax the rules by trying to improve, at each step, the
rule support, without sacrificing accuracy beyond the minimum acceptable
confidence level.

Figure 4-3 A visual representation of the RGA
Figure 4-3 provides a visual representation of the way the RGA works.
Consider a rule R: (X=High, Y=High) (Target = t). If, after relaxing the
Y=High condition the new rule R: (X=High, Y=*) (Target = t) has sufficient
accuracy (the support is already guaranteed), then R becomes a candidate
for replacing R.
In the worst case, the number of predicate replacements for each rule is in
O(n
2
). Any relaxation of a rule increases (or does not change) the support of
that rule. Therefore, relaxing a rule improves both its confidence and
support.
75


Example of iteratively applying RGA: This example is extracted from the
original experimental results presented with [1] . Let R be a complete rule in
the original O set. As mentioned, previously, all rules contain one predicate
for each of the four inputs.
The values for each of the descriptors are binned in 5 buckets (B
1
-B
5
), see
Chapter 6 below, presenting experimental results, for details.

R: (X
1
= B
1
, X
2
= B
2
, X
3
= B
2
, X
4
= B
3
) (Y = Excellent),
with sup(R)=6.25%, conf(R)=90.9%
(4.13)

Upon relaxing all the predicates associated with R and evaluating the
confidence and support for the relaxed derivatives, the best derivative is
selected:

R: (X
1
= *, X
2
= B
2
, X
3
= B
2
, X
4
= B
3
) (Y = Excellent),
with sup(R)=8.52%, conf(R)=93.3%
(4.14)

After applying the algorithm one more time to the generalized rule R, we
obtain a better generalization:

R: (X
1
= *, X
2
= B
2
, X
3
= *, X
4
= B
3
) (Y = Excellent),
with sup(R)=13.06%, conf(R)=95.65%
(4.15)



76


4.4.3 Applying the RGA to an apriori-derived set of rules
As described in Section 3.1.1, the most commonly used rule extraction
algorithm, apriori, produces a set of variable-length rules, having the
predicates in the antecedent sorted, usually lexicographically. Certain
apriori derivatives, such as Multiple Support apriori (discussed in 3.1.4) may
use a different sort order, but this order is preserved for all the rules that
are extracted by the algorithm.
This common sort order of the predicates, shared among all the rules in the
rule set, allows for a fast way of applying the Rules Generalization Algorithm
(introduced in the previous section) to apriori-produced rule sets. The
following property justifies the application of RGA to sets of rules
characterized by a shared sort order of the antecedent predicates.
Property 4.1: Consider two rules in a rule set having the same consequence,
C, each rule defined by a set of predicates P
i
in its antecedent: R
1
: ({P
1
}->C},
R
2
: ({P
2
}->C}. If P
1
c P
2
then R1 is a generalization of R2, similar to candidate
wildcard rules introduced in RCA.
Rationale: if P
1
is a proper subset of P
2
, then P
2
contains at least one
predicate C
i
:X
i
=x
i
, C
i
eP1. Each such predicate C
i
in the definition of P
2
can be
relaxed, resulting in P
2
=,P
1
, X
i
=*}. By repeating this for each C
i
eP
2
,

C
i
eP
1
a
relaxation of P
2
is obtained which is identical with P
1
.
77

Based on property 4.1, we propose an algorithm for simplifying apriori-like
rule sets. The algorithm traverses the set of lexicographically sorted rules
maintaining a stack of rule antecedents encountered during the scan. If a
rule matches one of the stacked prefixes, we check if the rule can be
generalized by one of the previous rules.
The algorithm is presented below:

Parameters:
T A set of rules sharing the predicate order in ant.
Output
A set of generalized rules T
Initialization:
Sort rules by consequent (resulting in subgroups G
i
cT, where
all rules in one such G
i
share the consequent.

For each group G
i
Reset prefix stack S

for each rule ReG
i
(as all rules in G
i
share the consequent, R
can be considered to be the antecedent)
while S=u (traverse the stack)
if S.topeR then
if S.confidence is satisfactory then
S.top is a generalization of R (and R can be dismissed)
end if
else Pop(S) // the stacked prefix does not match, remove it
end while // stack traversal is complete

if R has not been dismissed then
copy R to T
push R onto stack S
End if
End for Each


For a simple example, consider a trivial rule set consisting of three rules, as
below:
78


R
1
: X
1
=a Y = Excellent
R
2
: X
1
=a

AND X
2
=b

Y = Excellent
R
3
: X
1
=c

Y = Excellent
(4.16)

Rule R
1
is the first rule being read. The stack is empty, so it will not be
dismissed by a previous generalization. After processing R
1
it is added both
to the stack and to the output set T.
When rule R
2
is being read, the top of the stack contains the antecedent of
R
1
, X
1
=x
1
, which is included in the antecedent of R
2
. R
1
is, therefore, a
candidate generalization of R
2
and R
2
may be dismissed.
When rule R
3
is being read, the content of the stack does not share the
prefix of the rule, so the stack will be emptied.

As shown in the Chapter 6 below, presenting Experimental Results, this
algorithm produces very significant rule set simplifications. The experiments
suggest that the number of rules in a system is reduced, by this algorithm to
10%-20%. The complexity of the calculations is relatively small, at most
O(n
2
) in-memory operations (using the stack) where n is the cardinality of
the rule set.
Also, the RGA presented in the previous section needs to estimate the
support and confidence for each generalized rule. This is typically done by
scanning the data set (or by using additional memory in an index structure,
79

such as an FP-tree). The apriori flavor of the RGA does not require any
additional scans of the data.

Some weaknesses of the algorithm are easy to point out. For example, it is
easy to show that the greedy nature of algorithm prevents detection of all
possible generalizations of the rule set. Consider a system containing rules
like:

R
1
: X
1
=a

AND X
2
=b

Y = Excellent
R
2
: X
2
=b

Y = Excellent
(4.17)

Although R
2
is a generalization of R
1
, it will not be detected by the algorithm
because it appears in lexicographic order after R
1
.

4.5 Conclusion
We presented some of the recent research work regarding the rule systems
generalization and simplification. Much of this work is related to the space
of fuzzy rules.
The rule generalization algorithm introduced in this chapter produced very
promising experimental results, as shown in Chapter 6 below. Some known
weaknesses of the proposed algorithm suggest directions for further
research.
80

4.5.1 Future directions for the basic rule generalization algorithm
The RGA algorithm discussed above currently works by eliminating entire
slices of the premise space from the rule antecedents. While this approach
produced good results in our experiments, it is probably too coarse. A
better solution, although more computationally intensive, may be to check
the neighborhood of the initial antecedent and merge those areas which,
when added to the antecedent, keep the rules accuracy above the
minimum confidence criteria.

Figure 4-4 A finer grain approach to rule generalization
81


Figure 4-4 describes such a possible implementation. Consider a rule R:
(X=High, Y=High) (Target = t). The current algorithm relaxes, say, the
Y=High condition and produces a new rule R: (X=High, Y=*) (Target = t),
which may not have sufficient accuracy to replace R. Rather than removing
Y=High, the algorithm could investigate the vicinities of the original
antecedent cell (such as Y=M or Y=VH). The generalization would then
result in rules such as:

R:(X=High, Ye{High, Medium, Very High})
(Target = t).

(4.18)


82

This consists, in essence, in merging the antecedent part of two rules as
long as they are adjacent, they share the consequent and the resulting rule
does not fall below the minimum confidence threshold.
In the data problem we treated in [1], as well as in many real applications of
rule systems, the predicates in the antecedent, as well as in the consequent,
represent binning ranges of continuous variables. In this case, for a rule
R:(X
i
=x
i
Y=y
i
) we can define a function p:(X
i
=x
i
)[0,1] which describes
the probability density for the Y=y
i
predicate over the X
i
=x
i
area of the
space. The rule accuracy can then be thought of as a ratio between the
integral of this probability density function p and a constant function u=1
defined on the same area, (X
i
=x
i
). The accuracy of the R rule can be thought
of as :


(4.19)


83


Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the universal set
Lets consider p, the probability density function, is considered the
membership function for a fuzzy set. In this interpretation and using the
similarity measure introduced by Setnes in [68] and discussed in 4.3 above
the confidence of the rule becomes the similarity measure between the
fuzzy set defined by (p, X=x
i
) and the Universal set.
It may be interesting to investigate whether this idea might be converted in
the space of fuzzy rules, as a way of merging adjacent fuzzy sets that serve
as premises for Takagi-Sugeno rules with similar consequents, as suggested
in Figure 4-5.

From an implementation perspective, it is interesting to notice that the
algorithm allows block evaluation of multiple measurements. In a typical
relational database, all the neighbors of the premise space could be
evaluated in a single pass over data using GROUP BY relational algebra
84

constructs. This will likely produce significant performance gains. Recent
developments in the space of in-memory database systems (such as [82],
[83] ) may be useful in addressing the cost of computing the accuracy and
support while relaxing predicates.
4.5.2 Further work for the apriori specialization of the RGA
The reduction in the number of rules, as presented by the experimental
results, is significant. This reduction makes the rules set more accessible
and easier to interpret. Additional work is required, though, to estimate the
predictive power of the reduced rule set and to measure the accuracy
tradeoff that is being introduced by this rule set simplification technique.
As mentioned in Section 4.4.3, the greedy nature of algorithm prevents
detection of all possible generalizations of the rule set. A different direction
for further work is investigating whether a more complex data structure,
possibly combined with a new sort order which takes into account the
antecedents length before the lexicographic order, may address this issue.
More work is also needed to study the possibility of applying the rule
generalization algorithm to the area of multiple-level association rules
described in [84] (and also in Section 2.3.2 above).
85

5 Measuring the Usage Prediction Accuracy of
Recommendation Systems

Recommendation systems are some of the most popular applications for
data mining technologies. They are generally employed to use opinions of a
community of users in order to identify content of interest for other users.
Commercial implementations, such as Amazons, described in [85], are
helping users choose from an overwhelming set of products. The
importance of recommendation systems for industry is emphasized by the
Netflix prize [86], which attracted 51051 contestants, on 41305 teams from
186 countries (as of June 1011) in trying to build a movie recommendation
system to exceed the performance of Netflixs in-house developed system,
Cinematch.
In this chapter, we focus on metrics used for usage prediction accuracy on
offline datasets. The remaining content is structured as follows:
- An introduction to the usage of Association Rules for
recommendation systems
- An overview of the most commonly used instruments and metrics
for evaluating usage prediction accuracy
- A new instrument (diagram) proposed for evaluating usage
predictions accuracy and comparing different recommendation
systems.
- Implementation observations for the aforementioned instrument

86

5.1 Association Rules as Recommender Systems
Developers of one of the first recommender systems, [87], coined the term
Collaborative Filtering (CF) to describe a system which entails people
collaborating to help each other perform filtering by recording their
reactions to documents they read. The reactions are called annotations;
they can be accessed by other peoples filters. The term ended up being
used interchangeably with the term recommender system.
This area generated lots of scientific interest and some recent surveys, such
as [88], present in detail the algorithms and techniques being employed in
recommender systems. Item-based collaborative filtering recommendation
algorithms were introduced in [89], where the authors compare such a
system vs. user-based recommender systems. In [90], the authors show that
the Apriori algorithm offers large improvement in stability and robustness
and can achieve comparable recommendation accuracy to other commonly
employed methods, such as k-Nearest Neighbor systems.
5.2 Evaluating Recommendation Systems
Recommendation systems may be employed to annotate entities in their
context (such as filtering through structured discussion postings to discover
which may be of interest to a reader, [87] ) or to find good items, such as
the Netflix prize [86] or the Amazon recommendation engine [85].
From an implementation perspective, some of these systems may predict
item ratings (such as Netflix, [86]), while others are predicting the
probability of usage (e.g. of purchase), such as Amazons [85]. More
87

complex systems may serve as intelligent advisors, comprehensive tools
which use behavioral science techniques to guide a customer through a
purchase decision process and learn while doing this, as described in [91].
These differences in usage make comparing and evaluating accuracy
systems a difficult task, as such systems are often tuned for specific
problems or datasets. A very thorough analysis of the problem of
evaluating and comparing recommendation systems is presented by J.
Herlocker et al. in [92] and, more recently, by A. Gunawardana in [93]. Both
surveys present the tasks that are commonly accomplished by
recommendation systems, the types of analysis and datasets that can be
used and the ways in which prediction quality can be measured,

Most of the research on evaluating recommendation systems focuses on
the problem of accuracy, under the assumption that a system that provides
more accurate predictions will be preferred by a user or will yield better
results to the commercial system that deploys it. Accuracy measurements
are very different when a system predicts user opinions (such as ratings) or
probabilities of usage (e.g. purchase).
Accuracy evaluations can be completed using offline analysis, controlled live
user experiments [94], or a combination of the two. In offline evaluation,
the algorithm is used to predict certain withheld values from a dataset, and
the results are analyzed using one or more of the metrics that well discuss
in the following section. Offline evaluations are inexpensive and quick to
conduct, even on multiple datasets or recommendation systems at the
88

same time. Datasets including timestamps may be used to replay usage
(ratings and recommendations) scenarios: every time a new rating or usage
decision is made by a user, it is compared with the prediction based on the
prior data about that user.

5.3 Instruments for offline measuring the accuracy of usage
predictions
During offline evaluation, a dataset is typically available consisting in items
used by each user. A typical test consists in selecting a test user, then hiding
some of the selections and asking the recommendation system to predict a
set of items that the user will use, based on the remaining selections. The
recommended and hidden items may produce 4 different outcomes, as
shown in Table 5-1
Recommended Not Recommended
Used True Positives (TP) False Negatives (FN)
Not used False Positives (FP) True Negatives (TN)

Table 5-1 Classification of the possible result of a recommendation of an item to a user
The test may be more sophisticated when the items selected by a user are
qualified by time stamps, as is the case for retailers tracking recurrent visits
from customers (e.g. Amazon.com). In that case, a users items can be
revealed to the recommendation system in the actual chronological order.

89

5.3.1 Accuracy measurements for a single user
Upon counting the number of items in each cell of the Table 5-1 table, the
following quantities can be computed:



(7.1)


(7.2)


Precision and Recall were introduced in [95] as key metrics. These metrics
started being used for evaluation of recommendation systems in 1988 in
[96] and later in [97]. Precision represents the probability that a selected
item is relevant, while Recall represents the probability that a relevant item
will be selected. Relevance is, in the case of recommender systems, a
subjective concept, as the test user is the only person who to decide
whether a recommendation meets their requirements and the transaction
record is the only information about that users decision.
Precision and Recall are inversely related, as shown in [95] so while allowing
longer recommendation lists typically improves recall, it is likely to reduce
the precision. Several approaches have been taken to combine prevision
and recall into a single metric. One approach is the F1 metric, introduced in
[98], then used as a classifier metric in [99] and used for recommendation
systems in [97], defined as below:


90





(7.3)



In certain applications, the number of recommendations that can be
presented to a user is predefined. For such applications, the measures of
interest are Precision and Recall at N, where N is the number of presented
recommendations. For other applications, the number is not predefined, or
an optimal value needs to be determined. For the latter, curves can be
computed for metrics for various numbers of recommendations. Such
curves may compare precision to recall or true positives to false positives
rates.
The true positive/false positive curves, also known as ROC curves, are more
commonly used. ROC curves were introduced in 1969 in [100] under the
name of Relative Operating Characteristics but are more commonly
known under the name Receiver Operating Characteristics, which evolved
from their use in signal detection theory (see [101]). An example of an ROC
curve, plotting True Positives against False Positives, is shown in Figure 5-1.
The curve is obtained, for a test user, by sorting the ranked
recommendations in descending order of confidence. Then, for each
predicted item, starting at the origin of the diagram, one of the following
actions is executed:
a) If it is indeed relevant (e.g. used by the user, part of the hidden user
items) then draw the curve one step vertically
91

b) If the item is not relevant (not part of the hidden items) draw the curve
one step horizontally to the right.

Figure 5-1 Example of ROC Curve
A perfect predictive system will generate a ROC curve that goes straight up
until 100% of the items have been encountered, then straight right for the
remaining items. For multiple recommender systems, multiple ROC curves
can be plotted, one for each algorithm. If one curve completely dominates
the others, it is easy to pick the best system. When the curves intersect, the
decision depends on the application requirements. For example, an
application that can only expose a small number of recommendations may
choose the curve that is dominant in the left side of the ROC chart. Hanley
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
P
e
r
c
e
n
t

o
f

R
e
l
e
v
a
n
t

I
t
e
m
s

Percent of non-relevant Items
92

and McNeil, in [101], propose Area under Curve as a measure for comparing
implementations independently of application.
5.3.2 Accuracy Measurements for Multiple Users
We presented, in the previous section, some of the metrics used to
measure the accuracy of usage predictions for individual test users in offline
experiments. A number of strategies have been developed to aggregate the
results across test populations.
For applications that expose fixed length N recommendation lists, the
average precision and recall can be computed across the test population (at
length N), as shown in [97].
This aggregation approach is used in [102] to introduce an aggregated ROC
curve, computed over multiple users, using the same fixed number of
recommendations, called Customer ROC (CROC).
A special class of applications consists of those where the recommendation
process is more interactive, and users are allowed to obtain more and more
recommendations. Such applications can be modeled, in offline
experiments, when a timestamp is associated with each item ever used by
any test user. A ROC curve can be computed, in such as test, for each user.
The number of recommendations requested for each user depends on the
number of items used, in the test dataset, by each user. Certain
competitions, such as TRECs (Text Retrieval Conference) - [103] compute
ROC or precision/recall curves in this manner.
93

5.4 The Itemized Accuracy Curve
The accuracy measurements for recommendation systems described in the
previous section are commonly used in academic competitions or to
evaluate new systems. However, they are not commonly used in data
mining products. While lift charts, classification ROC diagrams and scatter
plots are common for classification and regression algorithms, most
products do not offer a built-in tool or comparing recommendation
systems, such as association rules models.
We propose a new instrument, introduced in [16], for evaluating the quality
of usage prediction on offline datasets. This instrument consists of a family
of curves that can be used to compare the recall for each individual item in
an item catalog for a family of recommendation systems.
The itemized accuracy curve was developed from a product need to present
users with an easy to understand diagram which allows comparing
recommendation systems as easily as the cumulative gain charts allow
comparing classification models.

A top-N recommender is a recommendation system configured to return the
top N most likely items for each input. In an industry setting such a
recommender takes as input information about one user and outputs N
items that the system predicts will be preferred by the user. A simple top-N
recommender is the Most-Frequent N-Item Recommender. It simply returns
the top N items that appear most frequently in a transaction database. A
94

more sophisticated top-N recommender may be an association rules
engine, which looks at all the user properties specified as input then
extracts those rules that match, in the antecedent, the input and sorts their
consequents by a certain rule measure, such as probability, importance, lift
etc.
Let

be a set of items and D a set of data points (database


transactions). Each transaction teD is defined as a tuple t = (C
t
, I
t
) where:
- C
t
is an optional set of characteristic properties of the transaction.
These characteristic properties may be transaction attribute from a
separate data dimension, such as customer demographic attributes,
store geographic attributes.
- I
t
_ I is the set of items, in the transaction, to be used in testing the
top-N recommender.
The process of testing a top-N recommender using the D data set consists in
evaluating how well the recommendation system predicts the items that
appear in each transaction teD. Testing for an item ieI
t
consists in
presenting the recommender with a t
i
transaction as input (t
i
derives
from t but does not include the item i) and then verifying the relationship
between the left out item i and the recommendation (presence, possibly
rank etc.). Ways to construct t
i
from t are described in Section 5.5.2 below.
Per definition, the top-N recommender will produce n recommendations
based on the specified input. Upon analysis of the n recommendations:
- A True Positive prediction is defined as the presence of item i in the
recommendation set. In this case, the recommender correctly
identified the item which was part of the test set
- A False Negative prediction is defined as the absence of item i in the
top N recommendations.
95


Let be a positive metric which describes the usage prediction accuracy
and that can be computed for each individual item that is part of the item
catalog I. Examples of such metrics include the number of true positives,
recall, precision, recall value (defined as recall multiplied by item value) etc.
Concepts such as True Positive or False Negative, which may appear in the
definition of such a metric, need to be adjusted for the particular case of a
top-N recommender (as shown above).

The rationale of the itemized accuracy curve for a top-N recommender is as
follows:
- Compute (over the test set) the accuracy measure for each
individual item in the item catalog I.
- Aggregate the systems accuracy measure over the item catalog.
Note that the aggregation may be any additive measure, not
necessarily a sum

e

- Compare the aggregated M measure with a minimum and maximum
theoretical baseline measures, M
min
and M
max

- Compute two new quantities:
o


96

The Lift describes the performance of the current top-N recommender as
compared against the minimum acceptable baseline measure (and the
improvement on top of M
min
). The Area Under Curve describes the
performance of the current top-N recommender as compared against the
maximum theoretical baseline measure (and the improvement on top of
M
max
). If the minimum baseline measure is associated with a baseline
recommender, then the lift of that recommender is by definition 1,
regardless of the value of n.
Similarly, the Area Under Curve metric is less or equal to 1 (as it represents
the ratio to the theoretical maximum value of the accuracy measure) and, if
the maximum baseline measure is associated with a recommender, then
that recommenders Area Under Curve is by definition 1, regardless of the
value of n.
Note that the Area Under Curve aggregation is not the homonymous metric
associated with ROC curves, although it shares some of its properties, such
as being upper bound by 1 or associating a value of 1 with an ideal model.

For practical purposes, the minimum theoretical baseline measure needs
not be worse than the measure yielded by the Most-Frequent n-Item
Recommender (MFnR). A few reasons for using the MFnR as a minimum
include:
- It is practically a zero-cost recommendation system, in terms of
implementation costs
97

- It is commonly used in industry if a more sophisticated
recommendation system is not available (e.g. Would you like fries
with that? in any fast food restaurant)
For the aforementioned reasons, we will use MFnR as the minimum
recommender in the rest of this chapter. An interesting property of the
MFnR recommender is that its accuracy grows with n.

Lemma 5.1 The number of True Positives of the MFnR recommender grows,
and the number of False Negatives decreases, with the value of the n
parameter, until n reaches the cardinality of the itemset.
Rationale:
Let , the cardinality of the items catalog. For any given the
following properties derive from the definition of True Positive and False
Negatives:

(7.4)
, (7.5)

In fact, when n reaches X, False Negatives becomes 0, True Positives
becomes n and the MFnR becomes an optimal recommender with regard to
the True Positive and False Negative measures.

98

The itemized accuracy curve is obtained by plotting the accuracy measure
for each individual item in the item catalog I (as ordinate). The sort order of
the items on the abscissa improves the clarity of the diagram. For example,
sorting the items in I, the item catalog, in descending order of the
max

metric (as computed for the maximum theoretical baseline measure) may
give a good intuitive perspective on the performance of the recommender
being analyzed.

5.4.1 A visual interpretation of the itemized accuracy curve


Figure 5-2 Itemized Accuracy Curve for a top-N recommender
99

Figure 5-2 presents such an itemized accuracy curve. The upper line
represents the
max
metric (as computed for the maximum theoretical
baseline measure) for each item, while the lower line is the metric,
computed for the top-N recommender being evaluated.
The aggregations of the metrics are equivalent to integrating the
measure over the item catalog I. Therefore, the aggregated measures of Lift
and Area Under Curve can be defined below

,
Both aggregations become, therefore, ratios between areas under curve for
the graphs defined by the metrics for different recommenders.
5.4.2 Impact of the N parameter on the Lift and Area Under Curve
measures
An interesting aspect of the Lift and Area Under Curve metrics is that they
allow comparing different values of N, the number of recommendations
being produced by the recommendations system. In an e-commerce
implementation of the recommendation system, the number of
recommendations presented on the screen must be a trade-off between
the potential value of the recommendations and that of other page
elements (such as advertisements) which may compete for the same page
real estate as the recommendations. It is, therefore, useful to analyze the
value (in terms of Lift and Area Under Curve) of various values for N, the
number of recommendations being presented.
100


Figure 5-3 Evolution of Lift and Area Under Curve for different values of N
Figure 5-3 presents the evolution of the Lift and Area Under Curve
measures for a top-N recommender as the value of N changes from 1 to
100.
The horizontal line at the ordinate 1 is the minimum baseline Lift,
associated with the MFnR minimum baseline. The upper line (on top of the
baseline lift) presents the lift yielded by the top-N recommender. As shown
101

previously, in Lemma 5.1, the MFnR recommenders accuracy grows, so the
lift of the top-N recommender decreases with the growth of N.
The lines in the lower part of the diagram represent the evolution of the
Area Under Curve measure with the growth of the N parameter. The Area
Under Curve of an ideal recommender is by definition 1, while the Area
Under Curve values associated with the MFnR recommender as well as the
top-N recommender being evaluated are growing to reach 1, in the worst
case when N reaches the cardinality of the itemset.


5.5 An Implementation for the Itemized Accuracy Curve
5.5.1 Accuracy measures
We found the number of True Positives (and certain derivatives) to be a
convenient measure for the accuracy measure for each individual item in
the item catalog I. It is a simple additive measure, which can be summed up
across the transaction space as well as across the item space.
As exemplified previously, we consider an ideal predictor as the source for
the M
max
aggregation, therefore a predictor that produces zero False
Negatives. The difference between M and M
max
is, therefore, the number of
False Negatives produced by the M recommendation system being
assessed.
102

A consequence of this choice is that the Area under Curve aggregated
measure is exactly the recall associated with the recommendation system.



A related additive measure that can be used for may include the catalog
value associated with an item, (i) = Value(i)*TP(i). This allows for a more
flexible estimation of the value propose by the recommender.
5.5.2 Real data test strategies
The test dataset D consists of transactions teD defined as tuple t = (C
t
, I
t
)
where C
t
, are the transaction specific properties while I
t
is the set of items
known to be included in the transaction and which should be tested against
the real recommendations. Testing for an item ieI
t
consists in presenting
the recommender with a t
i
transaction as input. t
i
derives from t but does
not include the item i. Two different ways to construct t
i
from t are
described below.

The simplest strategy is to treat each transaction as a bag of items. In that
case, the test for that respective transaction is performed by successively
leaving each item ieI
t
out, and requesting from the target system a
recommendation.
A more elaborate strategy may take into account a timestamp associated
with the moment when an item has been added to a transaction. In this
103

case, a possible strategy is to create the test input t
i
by including, besides
the characteristic transaction properties, only the items that appeared, in
the transaction, before item i (chronologically). This approach may be more
realistic for certain e-commerce scenarios.


5.5.3 The algorithm for constructing Itemized Accuracy Curve

The algorithm, presented below uses a test population to compute counts
of true positives and false negative recommendations. The number of
recommendations to be presented to a user is an algorithm parameter.
The algorithm collects the number of occurrences and True Positive
recommendations for each item in the catalog in two item-indexed
structures, GlobalCounts and TruePositives.
When the iteration is complete, the metrics of interest can be computed as:
- M the sum of the True Positives counts
- M
max
the sum of the GlobalCounts values
- M
min
the sum of those GlobalCounts values with indices in the top
N most popular items
Note that a frequency table for the most popular items can be computed in
the same iteration. This algorithm does not compute the frequency table as
104

real world database systems may have more efficient ways of returning the
top N most popular values in a table column.
Parameters :
n number of recommendations to be presented
D test set of transactions

Initialization:
Initialize GlobalCounts, TruePositives item-indexed
vectors of counts, initialized on 0

for each transaction T
x
=(C
x
, I
x
) in the test dataset D
for each item i in the I
x

increment GlobalCounts[i]
Let T
xi
= (C
x
, (I
x
i))
Let R
I
n
= TopRecommendations(n, T
xi
)
if i eR
I
n
then
increment TruePositives[i]

IterationEnd: compute the aggregated metrics

The algorithm traverses the space of test transactions and executes one
recommendation request for each item to be tested. The complexity of the
algorithm is, therefore

where |D| is the number of


transactions in the test dataset and the Avg(|I
t
|) is the average number of
test items in a transaction. Naturally, the execution time depends on the
recommendation systems implementation.

5.6 Conclusions and further work

The Itemized Accuracy Curve provides an intuitive way to compare
recommendation systems. It can be used with count or profit oriented
105

measures and it can provide very specific information about the behavior of
a recommendation system for each specific item in a product catalog.
Combined with a taxonomy of the items, such as an OLAP product
dimension and a hierarchy on top of that dimension, the Itemized Accuracy
Curve can be used to select specific recommendation systems for areas of
the product catalog.

Figure 5-4 Aggregated Itemized Accuracy Curve based on the Movie Recommendations dataset (for
N=5 recommendations)
0
500
1000
1500
2000
2500
3000
3500
MA-apriori_p40
(Ideal)
MA_Trees_2048
(MFNR)
106

Figure 5-4 presents aggregated accuracy results across the Category
attribute of the Movies Recommendations dataset.

The Itemized Accuracy Curve, however, does not take into account the
ranking of an item in the recommendation list. Investigating accuracy
measures that can be used with the Itemized Accuracy Curve in conjunction
with the ranking of items may provide more value.
Another direction of further research is integrating in the algorithm for
computing the itemized accuracy diagram the evaluation of other
performance characteristics of recommendation systems, such as:
- the degree to which a recommendation system covers the entire set
of items (see [104]),
- the computing time,
- the novelty of recommendations
- the robustness of recommendations (as defined in [105])
107

6 Experimental Results
We present here some of the experimental results discussed in the previous
chapters. The first section describes the datasets being used for each
experiment.
6.1 Datasets used in this material
6.1.1 IC50 prediction dataset
The IC
50
prediction dataset contains most of the data and experimental
results mentioned in Section 4.4 on rule generalization. The data as well as
some results are available at:
http://www.bogdancrivat.net/FAMR/FAMR_Rules.zip .
The package consists of multiple files:
- trainMols.txt 176 molecule descriptors and their associated IC
50
,
used as training set.
- trainMols_Discretized.csv the above mentioned molecules
discretized
- testMols.txt -20 test molecules and their associated IC
50

- testMols_Discretized.csv the test set, discretized
- LatestRules.txt the result of the FAM rules extraction process (used
as input in the Rule Generalization algorithm)

Following are the discretization ranges used in our experiments.
108

The descriptor range: B
i
should be read as bin i
- B
1
(Low): [0, 0.125)
- B
2
(Low-Medium): [0.125, 0.375)
- B
3
(Medium): [0.375, 0.625)
- B
4
(Medium-High): [0.625, 0.875)
- B
5
(High): [0.875, 1.0].
The IC50 value range
- Excellent: [0, 20)
- Good: [20,50)
- OK: [50, 100)
- Mediocre: [100, 500)
- Terrible: [500,MaxValue].
It should be noted that low IC50 is optimal.

6.1.2 Movies Recommendation
The Movie Recommendation dataset was collected for [2] and available
with the book or as Chapter 11 downloads at
http://www.wiley.com/WileyCDA/WileyTitle/productCd-
0470277742,descCd-DOWNLOAD.html
The Movies Recommendation dataset consists of 3200 responses to a
survey collecting movie (2707 movies), director (508 directors) and actor
109

(1192 actors) preferences. On average, each response contains 15 movies,
with a minimum of 1, a maximum of 106 and a standard deviation of ~20.
The dataset also contains demographic data about customers participating
in the survey.
6.1.3 Movie Lens
The Movie Lens dataset is an older recommendation data set, used first in
[106]. This dataset is publicly available at
http://www.grouplens.org/node/73 . The subset containing 1 million ratings
for 3900 movies by 6040 users has been used for experiments. On average,
each user has rated 165 movies, with a minimum of 20, a maximum of 2314
and a standard deviation of ~192.
6.1.4 Iris
A commonly used dataset, Iris is available as part of the Weka suite or
accessible on the web at http://archive.ics.uci.edu/ml/datasets/Iris.
Four continuous attributes (petal length, petal width, sepal length, sepal
width) are used to predict one of 3 classes of flowers (setosa, virginica or
versicolor). There are 50 data points in each class.

110

6.2 Experimental results for the Rule Generalization
algorithm
6.2.1 Rule set and results used in Section 4.4 (on rules
generalization)
The IC
50
prediction dataset described above has been used for this
experiment. All rules are described in the form:
(Molecular Weight, hDonors, hAcceptors, ClogP) IC
50
.
From the trained FAMR we obtain the following set of rules:
- O
1
: (Low-Medium, Low, Low, Low-Medium) Terrible
- O
2
: (Low-Medium, Low, Low, Medium) Mediocre
- O
3
: (Low-Medium, Low, Low-Medium, Medium-High) Excellent
- O
4
: (Low-Medium, Low-Medium, Low-Medium, Low-Medium) OK
- O
5
: (Low-Medium, Low-Medium, Low-Medium,Medium) Excellent
- O
6
: (Low-Medium, Medium, Low-Medium, Medium) Excellent
- O
7
: (Medium, Low-Medium, Medium, Medium) Excellent
- O
8
: (Medium, Low-Medium, Medium, Medium-High) Excellent
- O
9
: (Medium, Low-Medium, Medium, Highm) Excellent
- O
10
: (Medium-High, Low-Medium, Medium, High) Excellent
- O
11
: (Medium-High, Medium, Medium-High, Medium-High) Mediocre
- O
12
: (Medium-High, Medium-High, Medium-High,Medium) Terrible
- O
13
: (Medium-High, High, Medium-High, Medium) Mediocre
Rules {O
1
, . . . ,O
13
} have support between 0.0% and 16.47%, and confidence
between 0.00% and 100.00%. In order to remove irrelevant rules (pruning),
111

we introduce a minimum confidence criterion of 25% and a minimum
support criterion of 2.5%. Rule O
3
does not meet these criteria and was
removed from the set.

After applying the algorithm described in 4.4.2 above to the rule set {O
1
, . . .
,O
10
- ,O
3
} , the following generalized rules are obtained:
- G
1
: (*, Low-Medium, *, *) Excellent
- G
2
: (*, Medium, Low-Medium, *) Excellent
- G
3
: (Medium, *, Medium, *) Excellent
- G
4
: (*, Low, Low, Medium) Mediocre
- G
5
: (*, Medium-High, Medium-High, *) Terrible
As certain descriptor values do not appear in any rule, simple one-predicate
rules were produced to cover those slices of the descriptor space. Only one
such rule is produced for this dataset (after pruning those which do not
meet the minimum confidence and support criteria)
- I1 : (Low,*, *, *) Terrible
The combined rule set {G1, . . . ,G5} {I1 } is our end result. Finally, we
compared our FAMR rule extractor to the FNN [6][8] and to the following
standard decision trees implementations:
- CART (WEKA implementation - simpleCart) trees [107]
- Microsoft SQL Server 2008 Decision Trees [2]
112

For the decision trees, rules were extracted from each non-root node.
Naturally, the decision-tree derived rules have 100% coverage. The
complete comparison results are presented in Table 6-1.

Method/rule set Training Set
Coverage
Training Set
Accuracy
Test Set
Coverage
Test Set
Accuracy
FAMR: {O1, . . .
,O13}
57.39% 36.93% 20% 20%
FAMR: {G1, . . .
,G5}
86.36% 65.34% 90% 75%
FAMR: {G1, . . .
,G5} {I1}
88.64% 67.61% 90% 75%
CART 100% 64.20% 100% 75%
Microsoft
Decision Trees
100% 69.32% 100% 80%
Table 6-1 Rules set comparison

The FAMR {G
1
, . . . ,G
5
} {I
1
} rule set has very good coverage and accuracy.
For the test set, the {G
1
, . . . ,G
5
} {I
1
} rules have almost the same accuracy
as the rule set derived from classic decision trees system (the test set
consists of 20 molecules, so a difference of 5% translates to one incorrect
prediction). This is rather surprising, considering that the fact that decision
trees are a dedicated tool for rule generation, whereas the FAMR was
essentially designed as a primary prediction/classification model.

6.2.2 Results for the apriori post-processing algorithm
We present here some experimental results obtained after applying the
Rules Generalization Algorithm on various datasets:
Dataset Apriori params Initial Rules Rules after
generalization
IC
50
minconf=60%, 135 31
Movie
Recommendations
minconf=60%
minsup=3
18436 1788
113

Demographics,
predicting Home
Ownership
MovieRecommend
ations (associative)
Minconf=60%
Minsup=10
25058 5677
Iris (discretized) Minconf=60% 208 38

For the movie recommendation dataset, the demographic table has been
used. The apriori algorithm was employed to extract rules predicting home
ownership status from other demographic attributes.
6.3 Experimental results for the Itemized Accuracy Curve
A Windows application has been developed to illustrate and test the
Itemized Accuracy Curve concepts. The application functions as a client for
the Microsoft SQL Server Analysis Service platform, which allows
instantiation of multiple data mining algorithms on the same datasets.
Multiple association rules models were investigated using the IAC client
application. The application uses DMX [108] statements for executing the
recommendation queries.
The UI of the respective application is presented in Figure 6-1. The
application uses the True Positive count as accuracy metric and sorts the
items in the product catalog in descending order of their popularity on the
abscissa. The dominant curve (red line) is associated with an ideal
recommendation system which produces zero False Negatives (and, hence,
the curve is identical to the popularity curve). The green curve, present in
the left part of the diagram, is associated with the Most Frequent n-Items
114

Recommender. The other lines are associated with different
recommendation systems. Clicking at any point on the chart surface
presents the item rendered at the specified location on the abscissa
together with the number of True Positives yielded by each of the
recommenders, as in Table 6-2.
Model Correct Recommendations
(Ideal Model) 233
(MFnR) 0
MA_apriori_p20 120
MA_Trees_2048 165
Table 6-2 True Positive counts for the selected item


Figure 6-1 Itemized Accuracy Chart for n=3 (Movie recommendations)

115

6.3.1 Movie Recommendation Results
We have built four recommendation models, using Microsoft SQL Server:
MA_apriori_p20 and MA_apriori_p40 use the Microsoft Association Rules
algorithms, an optimized implementation of the Apriori algorithm. It uses a
minimum rule probability threshold of 0.2 and 0.4, respectively. They both
use a minimum support of 10 (meaning approximately 0.3% for this
dataset).
MA_Trees_256 and MA_Trees_2048 use the Microsoft Decision Trees
algorithm to build a forest of trees to be used for recommendations. They
build, respectively, 256 (default) and 2048 trees.
Figure 6-2 presents the lift of the 4 models as a function of n, the number of
recommendations:
116


Figure 6-2 Evolution of Lift for various values of N for test models (Movie Recommendations
dataset)

6.3.2 Movie Lens Results
We have built four recommendation models, using Microsoft SQL Server:
apriori, apriori_min_supp_10 and apriori_min_supp_100 use the Microsoft
Association Rules algorithm. It uses a minimum rule probability threshold of
0.2 and minimum support thresholds of 1000, 10 and 100.
DecisionTrees uses the Microsoft Decision Trees algorithm to build a forest
of trees to be used for recommendations. It contains 2048 trees.
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5
0 10 20 30
(Baseline MFnR Lift)
MA_apriori_p20-L
MA_Trees_2048-L
MA_Trees_256-L
MA-apriori_p40-L
117

Figure 6-3 presents the lift of the 4 models as a function of n, the number of
recommendations

Figure 6-3 Evolution of Lift for various values of N for test models (Movie Lens dataset)
It is interesting to notice that the decision tree outperforms the apriori
models and that some of the apriori models actually perform worse than
the Most Frequent n-Item Recommender.


0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30
(Baseline MFnR Lift)
apriori-L
apriori_min_supp_10-
L
apriori_min_supp_10
0-L
DecisionTrees-L
118

7 Conclusions and directions for further research
The thesis presents a synthesis of recent research in the area of associative
and predictive rules and post processing of these rules. The original
contributions are focused on practical improvements of rule systems.
7.1 Conclusions
In Chapter 4, we introduced a novel method for post-processing a set of
rules in order to improve its generalization capability. The method is
developed specifically for rules extracted from a fuzzy ARTMAP incremental
learning system used for classification, hence for rule generated indirectly
(as Fuzzy ARTMAP does not directly produce rules).
We also proposed an algorithm for generalizing rule sets produced by
common rule extraction algorithms, such as apriori. The experimental
results for this algorithm look very promising as they reduce the size of rule
sets by 5-10 times. More work is necessary to fully determine the properties
of this generalization algorithm, as shown in the next section.

In the second part of the thesis, in Chapter 5, we proposed a novel
instrument for evaluating the quality of recommendation systems, in the
context of recent research regarding the accuracy of recommendation
systems. The instrument has been introduced, as a patent, in [16]. The
Itemized Accuracy Curve has certain interesting properties. Among them:
119

- It provides an intuitive way of comparing different recommendation
systems
- It allows aggregations of the accuracy metrics across items
dimensions
7.2 Further Work

The Rules Generalization Algorithm introduced in Chapter 4 works by
eliminating entire slices of the premise space from the rule antecedents.
While this approach produced good results in our experiments, it is
probably too coarse. A better solution, although more computationally
intensive, may be to check the neighborhood of the initial antecedent and
merge those areas which, when added to the antecedent, keep the rules
accuracy above the minimum confidence criteria. Section 4.5.1 suggests
refinement of the algorithm which would result in rules such as:

R:(X=High, Ye{High, Medium, Very High})
(Target = t).

(4.20)

This consists, in essence, in merging the antecedent part of two rules as
long as they are adjacent, they share the consequent and the resulting rule
does not fall below the minimum confidence threshold.
Section 4.5.1 also describes a direction for research whether the refinement
may be applied to fuzzy rule sets, as a way of merging adjacent fuzzy sets
that serve as premises for Takagi-Sugeno rules with similar consequents.
120


From an implementation perspective, it is interesting to notice that the
algorithm allows block evaluation of multiple measurements. In a typical
relational database, all the neighbors of the premise space could be
evaluated in a single pass over data using GROUP BY relational algebra
constructs. This will likely produce significant performance gains. Recent
developments in the space of in-memory database systems (see [82], [83] )
may be useful in addressing the cost of computing the accuracy and support
while relaxing predicates.

In Section 4.4.3 we proposed an algorithm for generalizing the rule sets
produced by algorithms such as apriori, with significant reduction in the
number of rules, as presented by the experimental results. This reduction
makes the rules set more accessible and easier to interpret. Additional work
is required, though, to estimate the predictive power of the reduced rule
set and to measure the accuracy tradeoff that is being introduced by this
rule set simplification technique. The greedy nature of algorithm prevents
detection of all possible generalizations of the rule set. A different direction
for further work is investigating whether a more complex data structure,
possibly combined with a new sort order which takes into account the
antecedents length before the lexicographic order, may address this issue.
121

More work is also needed to study the possibility of applying the rule
generalization algorithm to the area of multiple-level association rules
described in [84] (and also in section 2.3.2 above).

Chapter 5 introduced the Itemized Accuracy Curve as an intuitive way to
compare recommendation systems. The Itemized Accuracy Curve, however,
does not take into account the ranking of an item in the recommendation
list. Investigating accuracy measures that can be used with the Itemized
Accuracy Curve in conjunction with the ranking of items may provide more
value.
Another direction of further research is integrating the evaluation of other
performance characteristics of recommendation systems, such as the
degree to which a recommendation system covers the entire set of items
(see [104]), the computing time, the novelty of recommendations or its
robustness [105] in the algorithm for computing the itemized accuracy
diagram.


122

Appendix A: Key Algorithms
Apriori
The following pseudo-code is the main procedure for generating frequent
itemsets (from [2]):
F: result set of all frequent itemsets
F[k]: set of frequent itemsets of size k
C[k]: set of candidate itemsets of size k
SetOfItemsets generateFrequentItemsets(Int minimumSupport){
F[1] = {frequent items};
for (k =1, F[k] >0; k++) {
C[k+1] = generateCandidates(k, F[k]);
for each transaction t in databases {
For each candidate c in C[k+1] {
if t contains c then c.count++
}
} //Scan the dataset.
for each candidate c in C[k+1] {
//Select the qualified candidates
if c.count >=Minimum_Support F[k+1] = F[k+1] U {c}
}
}
//Union all frequent itemsets of different size
while k>=1 do {
F = F U F[k];
k--;
}
return F;
}

To generate candidate itemsets C
k+1
from frequent itemsets F
k
, you use the
following SQL join statement:
Insert into Ck+1
Select x1.a1, x1.a2, ..., x1.ak, x2.ak
From Fk as x1, Fk as X2
Where
//match the itemset prefixes of size k-1
x1.a1 = x2.a1 And
123

x1.a2 = x2.a2 And
...
x1.ak-1 = x2.ak-1 And
//avoid duplicates
x1.ak <> x2.ak
This SQL statement generates candidate itemsets of size k having prefixes of
itemsets size k-1. However, it doesnt guarantee that all the subsets of
candidate itemsets are frequent itemsets. So, you must prune the
candidates containing infrequent subsets by using the following procedure:
Boolean hasInfrequentSubset(Itemset c, SetofItemsets F) {
For each (k-1) subset s of c {
If s not in F then return true;
}
return false;
}

The following procedure generates all of the qualified association rules:
For each frequent itemset f,
generate all the subset x and its complimentary set y = f - x
If Support(f)/Support(x) > Minimum_Probability, then
x => y is a qualified association rule
probability = Support(f)/Support(x)
End If



124

FP-Growth
As discussed in Section 3.1.2 above, the FP-growth algorithm extracts the
frequent items into a frequent pattern tree (FP-tree), retaining the itemset
association information, then divides the database into a set of conditional
databases, each associated with one frequent item, and mine each such
database separately.
A FP-tree is populated in the following steps [38] . A procedure called
BuildFrequentItemsList is supposed to exist and scan the transaction space,
creating a sorted list of items, in descending order of support. The
procedure is also supposed to eliminate infrequent items. The procedure is
not part of the implementation as it can often be optimized in a Database
or platform (e.g. SQL Server Analysis services). Another procedure, Sort, is
supposed to sort items in a transaction in the order specified in the list
argument.

Procedure FP_Create(TransactionSpace)
Let Tree = new node
Tree.item-name = null

Let L = BuildFrequentItemsList(TransactionSpace)

Foreach Trans in TransactionSpace
Let SortedTrans = Sort (Trans, L)
FP_Insert(Tree,SortedTrans)
Next
End Procedure


Procedure FP_Insert(Tree,Trans)
Let p = First Item in Trans
Let q = Reminder of Trans (excluding p)
If Tree has a child node N such as N.item-name = p.item-name
Then
N.count++
125

Else
Create new node N, child of Tree
N.item-name = p.item-name
N.count = 1
End If
FP_Insert(N, q)
End Procedure

Mining of a FP-tree is performed by calling FP_growth(FP_tree, null),
implemented as below (as described in [38]):
Procedure FP_Growth(Tree, x)
If Tree contains a single path P then
For each combination of the nodes in the path P
Generate pattern x with
supp=minimum support of nodes in
Else for each a
i
in the header of Tree
Generate pattern = a
i
x with supp=supp(a
i
)
Construct s conditional pattern base,
Construct s conditional tree, Tree

If Tree

= u then
call FP_Growth(Tree

, )
End if
End If
End Procedure


126

Bibliography
[1] Razvan Andonie, Levente Fabry-Asztalos, Ioan Bogdan Crivat, Sarah
Abdul-Wahid, and Badi Abdul-Wahid, "Fuzzy ARTMAP rule
extraction in computational chemistry," in Proceedings of the
International Joint Conference on Neural Networks (IJCNN),
Atlanta, GA, 2009, pp. 157-163.
[2] Jamie MacLennan, Ioan Bogdan Crivat, and ZhaoHui Tang, Data
Mining with Microsoft SQL Server 2008. Indianapolis, Indiana,
United States of America: Wiley Publishing, Inc., 2009.
[3] Ioan Bogdan Crivat, Paul Sanders, Mosha Pasumansky, Marius
Dumitru, Adrian Dumitrascu, Cristian Petculescu, Akshai
Mirchandani, T.K Anand, Richard Tkachuk, Raman Iyer, Thomas
Conlon, Alexander Berger, Sergei Gringauze, James MacLennan,
and Rong Guan, "Systems and methods of utilizing and expanding
standard protocol," USPTO Patent/Application Nbr. 7689703,
2010.
[4] Ioan B Crivat, Raman Iyer, and C James MacLennan, "Detecting and
displaying exceptions in tabular data," USPTO Patent/Application
Nbr. 7797264, 2010.
[5] Ioan B Crivat, Raman Iyer, and C. James MacLennan, "Dynamically
detecting exceptions based on data changes," USPTO
Patent/Application Nbr. 7797356, 2010.
[6] Ioan B Crivat, Raman Iyer, and James MacLennan, "Partitioning of a
data mining training set," USPTO Patent/Application Nbr.
7756881, 2010.
[7] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Efficient Column
Based Data Encoding for Large Scale Data Storage," USPTO
Patent/Application Nbr. 20100030796 , 2010.
[8] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Explaining changes
in measures thru data mining," USPTO Patent/Application Nbr.
7899776, 2011.
[9] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Random access in
run-length encoded structures," USPTO Patent/Application Nbr.
7952499, 2011.
127

[10] Ioan B. Crivat, Raman Iyer, C. James MacLennan, Scott Oveson, Rong
Guan, Zhaohui Tang, Pyungchul Kim, and Irina Gorbach,
"Extensible data mining framework ," USPTO Patent/Application
Nbr. 7383234, 2008.
[11] Ioan Bogdan Crivat, Pyungchul Kim, ZhaoHui Tang, James
MacLennan, Raman Iyer, and Irina Gorbach, "Systems and
methods that facilitate data mining," USPTO Patent/Application
Nbr. 7398268, 2008.
[12] Ioan Bogdan Crivat, C. James MacLennan, Yue Liu, and Michael
Moore, "Techniques for Evaluating Recommendation Systems,"
Application USPTO Patent/Application Nbr. 20090319330, 2009.
[13] Ioan, B Crivat, C, James MacLennan, and Raman Iyer, "Goal seeking
using predictive analytics," USPTO Patent/Application Nbr.
7788200, 2010.
[14] Ioan, Bogdan Crivat, Elena, D. Cristofor, and C. James MacLennan,
"Analyzing mining pattern evolutions by comparing labels,
algorithms, or data patterns chosen by a reasoning component ,"
USPTO Patent/Application Nbr. 7636698, 2009.
[15] Ioan, Bogdan Crivat, C., James MacLennan, ZhaoHui Tang, and Raman
S. Iyer, "Unstructured data in a mining model language," Patent
(USPTO) USPTO Patent/Application Nbr. 7593927, 2009.
[16] Ioan Bogdan Crivat, C. James MacLennan, Yue Liu, and Michael
Moore, "Techniques for Evaluating Recommendation Systems,"
Patent Application (USPTO) USPTO Patent/Application Nbr.
20090319330, 2009.
[17] Jeff Davis. (2002, July)Data Mining with Access Queries [Online].
http://www.techrepublic.com/article/data-mining-with-access-
queries/1043734
[18] devexpress. Pivot Table Style Data Mining Control for ASP.NET AJAX
[Online].
http://www.devexpress.com/Products/NET/Controls/ASP/Pivot_
Grid/
[19] Laura W. Murphy. (2010)Testimony Regarding Civil Liberties and
National Security: Stopping the Flow of Power to the Executive
Branch [Online].
128

http://judiciary.house.gov/hearings/pdf/Murphy101209.pdf
[20] Intel Corporation. (2005)Excerpts from A Conversation with Gordon
Moore: Moores Law *Online+.
ftp://download.intel.com/museum/Moores_Law/Video-
Transcripts/Excepts_A_Conversation_with_Gordon_Moore.pdf
[21] Chip Walter. (2005, July)Kryder's Law [Online].
http://www.scientificamerican.com/article.cfm?id=kryders-law
[22] John Gantz and David Reinsel. (2010, May)The Digital Universe
Decade Are You Ready? [Online]. http://idcdocserv.com/925
[23] Roger, E. Bohn and James, E. Short. (2010, January)How Much
Information? 2009 [Online].
http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_200
9.pdf
[24] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth,
"Knowledge Discovery and Data Mining: Towards a Unifying
Framework," in KDD, 1996.
[25] 11 Ants Analytics. www.11antsanalytics.com. [Online].
http://www.11antsanalytics.com/products/default.aspx
[26] Predixion Software. (2011) PredixionSoftware.com. [Online].
https://www.predixionsoftware.com/predixion/Products.aspx
[27] Microsoft Corp. (2008) www.microsoft.com. [Online].
http://www.microsoft.com/sqlserver/2008/en/us/data-mining-
addins.aspx
[28] IBM. ftp://public.dhe.ibm.com. [Online].
ftp://public.dhe.ibm.com/common/ssi/ecm/en/ytw03084usen/Y
TW03084USEN.PDF
[29] Rakesh Agrawal, Tomasz Imielinski, and Arun N Swami, "Mining
association rules between sets of items in large databases," vol.
22, pp. 207-216, 1993, p207-agrawal.pdf.
[30] Jiawei Han and Micheline Kamber, Data Mining Concepts and
Techniques. San Diego, CA, USA: Academic Press, 2001.
[31] Microsoft Corporation. Maximum Capacity Specifications for SQL
Server [Online]. http://msdn.microsoft.com/en-
us/library/ms143432.aspx
129

[32] Oracle. Logical Database Limits [Online].
http://download.oracle.com/docs/cd/B19306_01/server.102/b1
4237/limits003.htm
[33] Ramakrishnan Srikant and Rakesh Agrawal, "Mining quantitative
association rules in large relational tables," in International
Conference on Management of Data - SIGMOD, vol. 25, 1996, pp.
1-12, srikant96.pdf.
[34] Nikola K. Kasabov, Foundations of Neural Networks, Fuzzy Systems,
and Knowledge Engineering.: Massachusetts Institute of
Technology, 1998.
[35] E.H. Mamdani, "Application of Fuzzy Logic to Approximate Reasoning
Using Linguistic Synthesis," IEEE Transactions on Computers - TC,
vol. 26, no. 12, pp. 1182-1191.
[36] T. Takagi and M Sugeno, "Fuzzy identification of systems and its
applications to modelling and control," IEEE Transactions on
Systems, Man and Cybernetics, no. 15, pp. 116-132, 1985,
http://pisis.unalmed.edu.co/vieja/cursos/s4405/Lecturas/Takagi
%20Sugeno%20Modelling.pdf.
[37] Rakesh Agrawal and Ramakrishnan Srikant, "Fast Algorithms for
Mining Association Rules," in Very Large Databases VLDB, 1994,
http://www.eecs.umich.edu/~jag/eecs584/papers/apriori.pdf.
[38] Jiawei Han, Jian Pei, and Yiwen Yin, "Mining frequent patterns
without candidate generation," in International Conference on
Management of Data - SIGMOD, vol. 29, 2000, pp. 1-12,
dami04_fptree.pdf.
[39] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao, "Mining Frequent
Patterns without Candidate: A Frequent-Pattern Tree Approach,"
Data Mining and Knowledge Discovery, vol. 8, pp. 53-87, 2004,
dami04_fptree.pdf.
[40] Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe, "An
Efficient Algorithm for Mining Association Rules in Large
Databases," in Very large Databases VLDB, 1995, pp. 432-444,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.5
437&rep=rep1&type=pdf.
[41] Ramesh, C. Agarwal, Charu C. Aggarwal, and V.V.V. Prasad, "A Tree
130

Projection Algorithm For Generation of Frequent Itemsets,"
Journal of Parallel and Distributed Computing , 1999.
[42] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal,
"Efficient Mining of Association Rules Using Closed Itemset
Lattices," Information Systems - IS, vol. 24, no. 1, pp. 25-46, 1999,
http://cchen1.csie.ntust.edu.tw:8080/students/2009/Efficient%2
0mining%20of%20association%20rules%20using%20closed%20it
emset%20lattices.pdf.
[43] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal,
"Discovering Frequent Closed Itemsets for Association Rules,"
International Conference on Database Theory - ICDT, pp. 398-416,
1999,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.11
02&rep=rep1&type=pdf.
[44] Mohammed Javeed Zaki and Ching-jiu Hsiao, "CHARM: An Efficient
Algorithm for Closed Itemset Mining," in SIAM International
Conference on Data Mining - SDM, 2002, CHARM.pdf.
[45] Zijian Zheng, Ron Kohavi, and Llew Mason, "Real world performance
of association rule algorithms," in Knowledge Discovery and Data
Mining - KDD, 2001, pp. 401-406, RealWorldPerf01.pdf.
[46] Yun Sing Koh and Nathan Rountree, Rare Association Rule Mining
And Knowledge Discovery - Technologies for Infrequent and
Critical Event Detection. Hershey, PA: Information Science
Reference, 2010.
[47] Bing Liu, Wynne Hsu, and Yiming Ma, "Mining association rules with
multiple minimum supports," in Knowledge Discovery and Data
Mining - KDD, 1999, pp. 337-341.
[48] Hyunyoon Yun, Danshim Ha, Buhyun Hwang, and Keun Ho Ryu,
"Mining association rules on significant rare data using relative
support," Journal of Systems and Software - JSS, vol. 67, no. 3, pp.
181-191, 2003.
[49] Ke Wang, Yu He, and Jiawei Han, "Pushing Support Constraints Into
Association Rules Mining," IEEE Transactions on Knowledge and
Data Engineering : TKDE, pp. 642-658, 2003.
[50] Masakazu Seno and George Karypis, "LPMiner: An Algorithm for
131

Finding Frequent Itemsets Using Length-Decreasing Support," in
IEEE: International Conference on Data Mining ICDM, 2001.
[51] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D.
Ullman, and C. Yang, "Finding interesting associations without
support pruning," IEEE Transactions on Knowledge and Data
Engineering - TKDE, vol. 13, no. 1, pp. 64-78, 2001,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.72
94&rep=rep1&type=pdf.
[52] Yun Sing Koh and Nathan Rountree, "Finding Sporadic Rules Using
Apriori-Inverse," Lecture Notes in Computer Science, vol.
3518/2005, pp. 153-168, 2005.
[53] L. Szathmary, A. Napoli, P. Valtchev, and Vandceuvre-les-Nancy
LORIA, "Towards Rare Itemset Mining," in IEEE International
Conference on Tools with Artificial Intelligence - ICTAI 2007, 2007,
pp. 305-312, http://hal.archives-
ouvertes.fr/docs/00/18/94/24/PDF/szathmary-ictai07.pdf.
[54] J. R. Quinlan, "Induction of Decision Trees," Machine Learning - ML,
vol. 1, no. 1, pp. 81-106, 1986, InductionOfDT.pdf.
[55] Leo Breiman, Jerome Friedman, Charles J Stone, and R A Olshen,
Classification and Regression Trees.: Chapman & Hall, 1984.
[56] Cristopher M. Bishop, Neural Networks for Pattern Recognition. New
York: Oxford University Press, Inc, 1995.
[57] G.A. Carpenter and S Grossberg, The Handbook of Brain Theory and
Neural Networks, Michael A. Arbib, Ed. Cambridge, MA: MIT
Press, 2003,
http://cns.bu.edu/Profiles/Grossberg/CarGro2003HBTNN2.pdf.
[58] Robert Andrews, Joachim Diederich, and Alan B. Tickle, "Survey and
critique of techniques for extracting rules from trained artificial
neural networks," Knowledge Based Systems - KBS, vol. 8, no. 6,
pp. 373-389, 1995.
[59] Alan B. Tickle, Robert Andrews, Mostefa Golea, and Joachim
Diederich, "The Truth Will Come to Light: Directions and
Challenges in Extracting the Knowledge Embedded Within
Trained Artificial Neural Networks," IEEE TRANSACTIONS ON
NEURAL NETWORKS, vol. 9, no. 6, 1998,
132

TruthWillComeToLight.pdf.
[60] K Saito and R. Nakano, "Medical diagnosis expert system based on
PDP model," in IEEE International Conference on Neural
Networks, New York, 1988, pp. 1255-1262.
[61] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White, "Multilayer
feedforward networks are universal approximators," Neural
Networks, vol. 2, no. 5, pp. 359-366, 1989.
[62] Bart Kosko, "Fuzzy Systems as Universal Approximators," IEEE
Transactions on Computers - TC, vol. 43, no. 11, pp. 1329-1333,
1994, http://sipi.usc.edu/~kosko/FuzzyUniversalApprox.pdf.
[63] J. J. Buckley, Y. Hayashi, and E. Czogala, "On the equivalence of
neural nets and fuzzy expert systems," Fuzzy Sets and Systems,
vol. 53, no. 2, pp. 129-134, 1993.
[64] J.M. Benitez, J.L. Castro, and I. Requena, "Are artificial neural
networks black boxes?," IEEE Transactions on neural Networks,
pp. 1156 - 1164 , 1997,
http://www.imamu.edu.sa/Scientific_selections/abstracts/Math/
Are%20Artificial%20Neural%20Networks%20Black%20Boxes.pdf.
[65] S. Mitra and Y. Hayashi, "Neuro-fuzzy rule generation: survey in soft
computing framework," IEEE Transactions on Neural Networks,
vol. 11, no. 3, pp. 748-768, 2000.
[66] Razvan Andonie, Levente Fabry-asztalos, Catharine Collar, Sarah
Abdul-wahid, and Nicholas Salim, "Neuro-fuzzy Prediction of
Biological Activity and Rule Extraction for HIV-1 Protease
Inhibitors," in Symposium on Computational Intelligence in
Bioinformatics and Computational Biology - CIBCB, 2005, pp. 113-
120.
[67] J. Chorowski and J. M. Zurada, "Extracting Rules from Neural
Networks as Decision Diagrams," IEEE Transactions on Neural
Networks, vol. PP, no. 99, pp. 1 - 12, 2011,
ExtRulesNNDecisionDiagrams.pdf.
[68] Magne Setnes, Robert Babuska, Uzay Kaymak, and Hans R. van Nauta
Lemke, "Similarity Measures in Fuzzy Rule Base Simplification,"
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, vol.
28, no. 3, June 1998.
133

[69] M. Mizumoto and H. J. Zimmermann, "Comparison of fuzzy reasoning
methods," Fuzzy Sets and Systems - FSS, vol. 8, no. 3, pp. 253-283,
1982.
[70] Lszl T. Kczy and Kaoru Hirota, "Approximate reasoning by linear
rule interpolation and general approximation," International
Journal of Approximate Reasoning - IJAR , vol. 9, no. 3, pp. 197-
225, 1993.
[71] I.T. Jolliffe, Principal Component Analysis.: Springer, 2002.
[72] J.W. Sammon, "A Nonlinear Mapping for Data Structure Analysis,"
IEEE Transactions on Computers - TC, vol. C-18, no. 5, pp. 401-
409, 1969,
http://www.mec.ita.br/~rodrigo/Disciplinas/MB213/Sammon196
9.pdf.
[73] Manoranjan Dash and Huan Liu, "Feature Selection for
Classification," Intelligent Data Analysis - IDA, vol. 1, no. 1-4, pp.
131-156, 1997,
http://reference.kfupm.edu.sa/content/f/e/feature_selection_fo
r_classification__39093.pdf.
[74] B.G. Song, R.J., II Marks, S. Oh, P. Arabshahi, T.P. Caudell, and J.J.
Choi, "Adaptive membership function fusion and annihilation in
fuzzy if-then rules," in Second IEEE International Conference on
Fuzzy Systems, vol. 2, 1993, pp. 961 - 967.
[75] N. Xiong and Lothar Litz, "Reduction of fuzzy control rules by means
of premise learning - method and case study," Fuzzy Sets and
Systems - FSS, vol. 132, no. 2, pp. 217-231, 2002,
http://www.sciencedirect.com/science/article/pii/S01650114020
01124.
[76] Johannes A. Roubos, Magne Setnes, and Jnos Abonyi, "Learning
fuzzy classification rules from labeled data," Information Sciences
- ISCI, vol. 150, no. 1-2, pp. 77-93, 2003,
http://sci2s.ugr.es/keel/pdf/specific/articulo/15-E.pdf.
[77] Gail A. Carpenter, Stephen Grossberg, and David B. Rosen, "Fuzzy
ART: Fast stable learning and categorization of analog patterns by
an adaptive resonance system," Neural Networks, vol. 4, no. 6,
pp. 759-771, 1991,
134

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.23
79&rep=rep1&type=pdf.
[78] R Andonie and L. Sasu, "Fuzzy ARTMAP with input relevances," IEEE
Transactions on Neural Networks, vol. 17, pp. 929941, 2006.
[79] Gail Carpenter and H. A. Tan, "Rule Extraction: From Neural
Architecture to Symbolic Representation," Connection Science,
vol. 7, no. 1, pp. 3-27, 1995.
[80] S. C. Tan, Chee Peng Lim, and M. V. C. Rao, "A hybrid neural network
model for rule generation and its application to process fault
detection and diagnosis," Engineering Applications of Artificial
Intelligence - EAAI, vol. 20, no. 2, pp. 203-213, 2007.
[81] G. A Carpenter and A.-H. Tan, "Rule Extraction, Fuzzy ARTMAP and
medical databases," in Proceedings of the World Congress on
Neural Networks, Portland, Oregon; Hillsdale, NJ, 1993, pp. 501-
506,
http://digilib.bu.edu/journals/ojs/index.php/trs/article/view/430.
[82] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Efficient Column
Based Data Encoding for Large Scale Data Storage," Patent
Application (USPTO) USPTO Patent/Application Nbr.
20100030796, 2010.
[83] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Random access in
run-length encoded structures," Patent (USPTO) USPTO
Patent/Application Nbr. 7952499, 2011.
[84] Jiawei Han and Yongjian Fu, "Discovery of Multiple-Level Association
Rules from Large Databases," in Very Large Databases - VLDB,
1995, pp. 420-431,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.32
14&rep=rep1&type=pdf.
[85] Greg Linden, B. Smith, and J. York, "Amazon.com recommendations:
item-to-item collaborative filtering," Internet Computing, IEEE ,
vol. 7, no. 1, pp. 76 - 80, January 2003.
[86] Netflix. Netflix Prize. [Online]. http://www.netflixprize.com
[87] David Goldberg, David A. Nichols, Brian M. Oki, and Douglas Terry,
"Using collaborative filtering to weave an information tapestry,"
Communications of the ACM - CACM, vol. 35, no. 12, pp. 61-70,
135

1992,
http://www.ischool.utexas.edu/~i385d/readings/Goldberg_Using
Collaborative_92.pdf.
[88] Xiaoyuan Su and Taghi M. Khoshgoftaar, "A Survey of Collaborative
Filtering Techniques," Advances in Artificial Intelligence, no.
January 2009, 2009,
http://www.hindawi.com/journals/aai/2009/421425/.
[89] Badrul Sarwar, George Karypis, Joseph Konstan, and John Reidl,
"Item-based collaborative filtering recommendation algorithms,"
in World Wide Web Conference Series - WWW, 2001, pp. 285-
295,
http://glaros.dtc.umn.edu/gkhome/fetch/papers/www10_sarwar
.pdf.
[90] Jeff J. Sandvig, Bamshad Mobasher, and Robin D. Burke, "Robustness
of collaborative recommendation based on association rule
mining," in Conference on Recommender Systems - RecSys, 2007,
pp. 105-112,
http://maya.cs.depaul.edu/~mobasher/papers/smb-
recsys07.pdf.
[91] R Andonie, J.E. Russo, and R. Dean, "Crossing the Rubicon: A Generic
Intelligent Advisor," International Journal of Computers,
Communications & Control, vol. 2, pp. 5-16, 2007,
http://www.cwu.edu/~andonie/MyPapers/Advisor%202005.pdf.
[92] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John
T. Riedl, "Evaluating collaborative filtering recommender
systems," ACM Transactions on Information Systems - TOIS, vol.
22, no. 1, pp. 5-53, 2004,
http://web.engr.oregonstate.edu/~herlock/papers/tois2004.pdf.
[93] Asela Gunawardana and Guy Shani, "A Survey of Accuracy Evaluation
Metrics of Recommendation Tasks," Journal of Machine Learning
Research - JMLR, vol. 10, pp. 2935-2962, 2009,
http://research.microsoft.com/pubs/118124/gunawardana09a.p
df.
[94] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M.
Henne, "Controlled experiments on the web: survey and practical
136

guide," Data Mining and Knowledge Discovery, vol. 18, no. 1, pp.
140-181,
http://www.springerlink.com/content/r28m75k77u145115/fullte
xt.pdf.
[95] Cyril W. Cleverdon and Michael Keen, "Aslib Cranfield research
project - Factors determining the performance of indexing
systems; Volume 2, Test results," 1966.
[96] Daniel Billsus and Michael J. Pazzani, "Learning Collaborative
Information Filters," in International Conference on Machine
Learning - ICML, 1998, pp. 46-54,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.47
81&rep=rep1&type=pdf.
[97] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl,
"Analysis of recommendation algorithms for e-commerce," in
ACM Conference on Electronic Commerce - EC, 2000, pp. 158-167.
[98] C. J. Van Rijsbergen, Information Retrieval.: Butterworth-Heinemann,
1979.
[99] Yiming Yang and Xin Liu, "A re-examination of text categorization
methods," in Research and Development in Information Retrieval
- SIGIR, 1999,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.95
19&rep=rep1&type=pdf.
[100] John A. Swets, "EFFECTIVENESS OF INFORMATION RETRIEVAL
METHODS," 1969.
[101] James A. Hanley and Barbara J. McNeil, "The Meaning and Use of the
Area undera Receiver Operating Characteristics (ROC) Curve,"
Radiology, vol. 143, no. 1, pp. 29-36, April 1982,
http://www.medicine.mcgill.ca/epidemiology/hanley/software/H
anley_McNeil_Radiology_82.pdf.
[102] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, David M.
Pennock, and David Ungar, "Methods and metrics for cold-start
recommendations.," in Research and Development in Information
Retrieval - SIGIR, 2002, MethodMetricsColdStart.pdf.
[103] Ellen M. Voorhees, "Overview of the TREC 2002 Question Answering
Track," in Text Retrieval Conference - TREC, 2002,
137

http://trec.nist.gov/pubs/trec11/papers/QA11.pdf.
[104] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa,
"Effective personalization based on association rule discovery
from web usage data," in Web Information and Data
Management - WIDM, 2001, pp. 9-15.
[105] Franois Fouss and Marco Saerens, "Evaluating Performance of
Recommender Systems: An Experimental Comparison," in Web
Intelligence - WI, 2008, pp. 735-738.
[106] B. J. Dahlen, J. A. Konstan, J. L. Herlocker, N. Good, A. Borchers, and
Riedl J., "Jump-starting movielens: user benefits of starting a
collaborative filtering system with "dead data"," , 1998.
[107] Ian, H. Witten and Eibe Frank, Data Mining - Practical Machine
Learning Tools and Techniques. San Francisco, CA, USA: Morgan
Kauffman, 2005.
[108] Microsoft Corp. Data Mining Extensions (DMX) Reference [Online].
http://msdn.microsoft.com/en-us/library/ms132058.aspx
[109] Usama Fayyad, Georges, G. Grinstein, and Andreas Wierse,
Information Visualization in Data Mining and Knowledge
Discovery. San Diego, CA, USA: Academic Press, 2002.
[110] D. Bamber, "The area above the ordinal dominance graph and the
area below the receiver operating characteristic graph.," Journal
of Mathematical Psychology, vol. 12, pp. 387-415, 1975.
[111] (2011) http://citeseerx.ist.psu.edu/. [Online].
http://citeseerx.ist.psu.edu/
[112] Microsoft Academic Search. [Online].
http://academic.research.microsoft.com/
[113] Google Scholar. [Online]. http://scholar.google.com/
[114] Ioan, B Crivat, C, James MacLennan, Raman Iyer, and Dumitru
Marius, "Using a rowset as a query parameter," Patent (USPTO)
USPTO Patent/Application Nbr. 7451137, 2008.

You might also like