You are on page 1of 6

Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

OLEX: RULE BASED LEARNING FOR TEXT CATEGORIZATION

1
N.poongavanam
2
R.Giritharan, M.C.A., M.E.
1
II-Year \M.E. Computer Science and Engineering
2
Asst.Prof/CSE
Department of Computer science & Engineering
Thiruvalluvar College of Engineering &Tech.,
Vandavasi.

ABSTRACT—This paper describes Olex, a novel method for the automatic induction of rule-based text
classifiers. Olex supports a hypothesis language of the form “if T1 or… or Tn occurs in document d, and none of
Tn+1, . . . Tn+m occurs in d, then classify d under category c,” where each Ti is a conjunction of terms. The
proposed method is simple and elegant. Despite this, the results of a systematic experimentation performed on
the REUTERS-21578, the OHSUMED, and the ODP data collections show that Olex provides classifiers that
are accurate, compact, and comprehensible.

Index Terms—Data mining, text mining, clustering, classification, and association rules, mining methods and
algorithms.

1. INTRODUCTION Olex achieves high performance on all data


A text classifier (or simply “classifier”) is sets and induces classifiers that are compact and
a program capable of assigning natural language comprehensible. The comparative analysis
texts to one or more thematic categories on the performed against some state-of the-art learning
basis of their contents. A number of machine methods, notably, Naive Bayes (NB), Ripper, C4.5,
learning methods like k-nearest neighbors (k-NN), polynomial SVM, and Linear Logistic Regression
probabilistic Bayesian, neural networks,and SVMs. (LLR) demonstrates that Olex is more than
Rule-based classifiers provide the competitive in terms of both predictive accuracy
desirable property of being interpretable and, thus, and efficiency. A previous version of Olex, where
easily modifiable based on the user’s a priori rules with just “simple” terms (i.e., terms of the
knowledge. In this paper, we propose Olex, a novel form t 2 d and: t 2 d) are learned, is described in
method for the automatic induction of rule-based [8]. In the remainder of this paper, after providing
text classifiers. Here, a classifier is a set of an overview of Olex and giving some preliminary
propositional rules, each characterized by one definitions and notation , we state the optimization
positive literal and (zero or) more negative literals. problem of selecting a best set of discriminating
A positive (respectively, negative) literal is of the terms (which is the heart of our method) and prove
form T 2 where T is a conjunction of terms t1 ^ … that this task is computationally difficult . Thus, we
^ tn (a term ti being a n-gram) and d a document. propose a heuristic approach to solve it and give a
Rule induction is based on a greedy optimization description of the whole learning process.
heuristics whereby a set of high-quality rules is
generated for the category being learned. Unlike 2. OLEX OVERVIEW
other (either direct or indirect) rule induction
algorithms, e.g., Ripper and C4.5, Olex is a one- Olex is an inductive rule learning method for Text
step process, i.e., it directly mines the final rule set, Categorization (TC). Informally, the TC induction
without the need of any postinduction optimization. problem can be stated as follows: Given .

309
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

• a background knowledge B as a set of results has been given by Cohen and Page in [8].
ground logical facts of the form t 2 d, meaning that However, while in ILP it is assumed that the input
term t occurs in document d (other ground sample is consistent with some hypothesis in the
predicates may occur in B as well) and . hypothesis space, this is not necessarily true in TC;
• a set P of positive examples consisting of
indeed, the relationship between terms and
ground logical facts of the form d 2 c, meaning that
document d belongs to category c (ideal categories is nondeterministic, i.e., it is not
classification); given P, the set N of negative possible, in general, to correctly categorize a
examples consists of the facts d 2 c that are not in P document under a category only on the basis of the
constructs a hypothesis Hc (the classifier of c) that, terms occurring in it. For this reason, the expected
combined with the background knowledge B, is induced hypothesis is in general one which
(possibly) consistent with all positive and negative maximally satisfies (both positive and negative)
examples, i.e., B^Hc _ P and B^Hc 6_ N. The
examples.
induced rules will allow prediction about the
belonging of a document to a category on the basis
of the presence or absence of some terms in that 3. NOTATION AND PRELIMINARY
document. DEFINITIONS
The above induction problem is essentially an Throughout this paper, we assume the existence of
instance of Inductive Logic Programming (ILP), 1. a finite set C of categories, called classification
which deals with the general problem of inducing scheme, 2. a finite set D of documents (i.e.,
logic programs from examples in the presence of sequences of words), called corpus; D is partitioned
background knowledge. It is well known that ILP into a training set TS, a validation set and a test set;
problems are computationally intractable, so that a the training set along with the validation
main topic is that of identifying classes of etrepresent the so-called seen data, used to induce
programs that are efficiently learnable. The theory the model, while the test set represents the unseen
of PAC-learnability [9] provides a model of data, used to assess the performance of the induced
approximated polynomial learning where the model, and 3. a relation I _ C_D (ideal
polynomially bound amount of resources (both classification) which assigns each document d 2 D
number of examples and computational time) is to a number of categories in C. We denote by TSc
traded off against the accuracy of the induced the subset of the raining set TS whose documents
hypothesis. It is well known that, if a learning belong to category c according to I (the training set
problem is PAC-learnable, the related consistency of c). For this reason, in the following, we will
problem is in the randomized polynomial concentrate on a single category c 2 C. Once a
complexity class RP [10]. In [9], Valiant identifies classifier for category c has been constructed, its
a subset of propositional logic which is PAC- capability to take the right categorization decision
learnable. More recently, several authors is tested by applying it to the documents of the test
investigated the identification of PAC-learnable set and then comparing the resulting classification
subclasses of first-order Horn clauses, providing a to the ideal one. The effectiveness of the predicted
number of both positive and negative results. For classification is measured in terms of the classical
an instance, in [11], Dzeroski et al. show that a notions of Precision, Recall, and F-measure.
restricted class of function-free clauses, namely k-
iscriminative nonrecursive ij-determinate predicate 4. SELECTION OF DISCRIMINATING
definitions, is PAC-learnable. In [19], Kietz and TERMS:PROBLEM DEFINITION AND
Deroski prove that relaxing some syntactic COMPLEXITY
restriction to the above class of programs implies In this section, we provide a description of the
the loss of PAC-learn ability (unless RP ¼ NP). optimization problem aimed at generating a best set
Finally, in [13], Gottlob et al. show that learning a of discriminating terms (d-terms, for short) for
function-free Horn clause with no more than k category c 2 C. In particular, we give a formal
literals from a set of both positive and negative statement of the problem and show its complexity.
examples expressed as function-free ground Horn To this end, some preliminary definitions are
clauses is NP-hard (notably, the respective ILP- needed (Table 1). A term (or n-gram) is a sequence
consistency problem is _P2 ) and, thus, not PAC- of one or more words, or variants obtained by using
learnable (unless RP ¼ NP). An overview of some word stems, consecutively occurring within a
important techniques for deriving complexity document. A scoring function _ (or feature

310
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

selection function—often simply “function”,


hereafter), such as Information Gain and Chi Definition 4 (eligible documents). Given a set of
Square (see, e.g., [12] and [33]), assigns to a term t (independent) d-terms Xc={T+1,….T+n, T-n+1, T-
a value _ðt; cÞ expressing the goodness” of t w.r.t. n+m} for c, we say that a document d 2 TS is eligible
category c. Scoring functions are used in TC for for classification under c according to Xc if any of
dimensionality reduction: noninformative words the positive d-terms T+1,…, T+n occurs in d and
are removed from documents in order to improve none of the negative d-terms T-n+1,…,Tn+m occurs
both learning effectiveness and time efficiency. in d4.
PROBLEM DT-GEN (d-term generation).
Definition 1 (vocabulary). Given a scoring Given a vocabulary V(Ø,v), find a set Xc of
function _ and a nonnegative integer _, let Vc(Ø,v) (independent) d-terms over V(Ø,v) such that, by
denote the set consisting of the _ terms t, occurring classifying under c the set E(Xc)of the documents
in the documents of TSc, having the highest value eligible for classification under c according to Xc,
of Ø(t,c). The vocabulary V(Ø,v), for the given Ø the resulting Fα (Xc)is maximum. Intuitively, a
and v, is the set consisting of the best v terms, solution of problem DT-GEN is a set of d-terms Xc
according to Ø, of each category c € C. such that E(Xc)best “fits” the training set
TSc.Next, we show that DT-GEN is a
Definition 2 (coterms). Let us fix a vocabulary computationally difficult problem. To this end, we
V(Ø,v). A coterm (conjunctive term) T over need the following preliminary result.
V(Ø,v) of degree k is a conjunction of terms t1
^….. ^ t k, with ti € V(Ø,v), 1< i< k. A coterm of Lemma 1. Let A be a set of independent coterms
degree 1 is a simple term. over a given vocabulary V(Ø,v). Then, the size of
Definition 3 (d-terms). Let us fix a vocabulary A has no polynomial bound on the size of V(Ø,v).
V(Ø,v). A d-term (discriminating term) for c over
V(Ø,v). is a pair < T,s > , where T is a coterm over Proposition 1. Problem DT-GEN requires
V(Ø,v). and s € {+,-) the sign of T. We will exponential time in the size of the vocabulary
represent< T, s > as Ts. A d-term with sign “+” V(Ø,v)
(respectively, “_”) is called positive
(respectively,negative) d-term. We say that 1) Ts Proof. Even restricting problem DT-GEN to admit
occurs in a document d if T occurs in d, 2) Ts has solutions Xc consisting solely of independent d-
degree k if T has degree k, and 3) Ts1and Ts 2 are terms, Xc has no polynomial bound on the size of
independent if T1 and T2 V(Ø,v) (this immediately follows from Lemma 1).
Thus, an exponential amount of time is needed to
TABLE 1 generate a solution. Tu Next, we show that DT-
List of the Main Symbols Used in This Paper GEN remains difficult even if we restrict Xc to
consists of d-terms of degree 1.

5. AHEURISTICS FOR DEALING WITH


PROBLEM DT-GEN
To deal with the complexity of problem 1DT-GEN,
we devised the greedy heuristics sketched in Fig.1.
The input consists of the vocabulary V(Ø,v), along
with the parameter α. The do-while-loop of lines 4-
9 generates the sequence of sets of d-terms
X0=Ø,…,Xi=GNS(X i-1),…,Xm=GNS(X m-1)
by invoking, at each round, function Generate-
New-Set (lines p1-p11). The ith call to this function
“extends” the current solution Xi_i by picking
(lines p3-p10) the term t € V(Ø,v), which, joined to
a d-term Ts in the current solution (line p5), yields
the greatest increase of the objective function Fα

311
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

(notice that the current maximum value of Fα is solution Xc of problem DT-GEN generated by
stored into Fmax, a global variable which is algorithm Greedy-Olex, for given values of Ø, v,
modified by Generate-New-Set at line p8). The and α. Now, to learn a “best” classifier Hc of c,
symbol _ at line p4 represents the logical constant essentially Olex repeatedly induces Hc(Ø,v,α), for
true (taken both positively and negatively). This different input vocabularies, each time validating it
formal device is needed in order for the algorithm over the validation set. Here, the input consists of a
to capture also d-terms of degree 1. The evaluation number of vocabularies (obtained by setting
of F_ is based on (2). Once the best term t has been different values of Ø and v), along with a value for
selected, it is returned to the main program through (which depends on the needs of the application at
the global variable topt (which is updated at line hand; usually, α=0.5).Whenever a classifier
p8) and then removed from V(Ø,v), (line 8). The Hc(€,v,α)is generated by the induction step (lines
do-while-loop iterates as long as the vocabulary 4-5), it is validated over the validation set (line 7).
V(Ø,v), is not empty and a new d-term is Finally, the classifier Hc with the maximum value
generated. It is straightforward to realize that the of the F-measure over the validation set is output
size of Xc. (line 9). Hc is assumed to be the “best classifier” of
c. Notice that, of the three model parameters €,v,
and α, only the former two are actually used for
“driving” the search of Hc.

6.1 Benchmark Corpora


The OHSUMED test collection is a subset of the
MEDLINE database, which is a bibliographic
database of important, peer-reviewed medical
literature maintained by the National Library of
Medicine. The subset we considered is the
collection consisting of some documents from the
research abstracts .The ODP data set consists of
millions of Websites classified according to a set of
hierarchically organized categories. For our
experimentation, we decided to use the subset of
some documents classified under the categories of
the Top/Science subtree, which has a first level
with 5 categories (hereafter, we will refer to this
subset as ODPS25). We performed our experiments
using a flat structure, where all documents below
level one were considered as
belonging to the respective root . As for
documents, we extracted, from the ODP data set
stored in Resource Description Framework (RDF)
format, the title and the description (we left out the
URL) for each Web page in the catalog. Thus, an
ODP document consists of the title of the Web
page plus its description.
6. THE LEARNING PROCESS 6.2 Document Preprocessing
While algorithm Greedy-Olex is the search of a Preliminarily, all corpora were subjected to the
“best” classifier over the training set, for given following preprocessing steps. First, we removed
values of the input parameters, the learning process from documents all words occurring in a list of
is the search of a “best” classifier over the common stop words, as well as punctuation marks
validation set, for all input parameters values. Let and numbers. Then, we generated the stem of each
Hc(Ø,v,α) denote the classifier of c associated to a of the remaining words, so that documents were

312
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

represented as sets of word stems. Second, we F-measure and BEP obtained at each of the five
proceeded to the partitioning of the training folds, and the respective means are given Table 3,
corpora: in turn, provides a picture of the results for the 10
As far as REUTERS-21578 and OHSUMED are most frequent categories of R90 (hereafter referred
concerned, we segmented each corpus into five to as R10), averaged over the five folds. In
equalsized partitions for cross validation. During particular, besides F-measure and BEP, for each
each run, four partitions will be used for training, category we report the average characteristics of
and one for validation (note that validation and test the respective best classifiers (i.e., number of rules
sets coincide in this case). Each of the five and number of negative literals occurring in each
combinations of one training set and one validation rule—recall that the sum of these two
set is a fold. values equals the number of induced d-terms).
Concerning ODP-S25 (for which the holdout
method was used), we segmented the corpus into 6.4.2 Effect of Category Size on Performance
two partitions: the seen data (70 percent of the We partitioned the categories in R90 with more
corpus documents) and the unseen data (the than seven documents into four intervals, based on
remaining 30 percent). The former is used to their size. Then, we evaluated the mean F-measure
induce the model (according to the learning process over the categories of each group, averaged over
of Fig. 2), while the latter is used for testing. The the five folds. Results are summarized in Table 5.
seen data were then randomly split into a training As we can see, the F-measure values indicate that
set (70 percent), on which to run algorithm Greedy- performances are substantially constant on the
Olex, and a validation set, on which tuning the various subsets, i.e., there is no correlation between
model parameters. We performed both the above category size and predictive accuracy (this is not
splits in such a way that each category was the case of other machine learning techniques, e.g.,
proportionally represented in both sets (stratified decision tree induction classifiers, which are biased
holdout). toward frequent classes [15]).
Finally, for every corpus and training set, we
scored all terms occurring in the documents of the 6.5 Results with OHSUMED (Cross Validation)
training set TSc of c, for each c € C. The second data set we considered is OHSUMED
and the task was to assign documents to one or
more categories of the 23 MeSH diseases. As for
the REUTERS-21578, we performed a fivefold
cross validation by running, at each fold, Algorithm
6.3 Performance Metrics 2 with input vocabularies V(CHI,v), with
Classification effectiveness was measured in terms v€{10,20,30,..100} (also in this case, _ was set to
of the classical notions of Precision, Recall, and F- 0.5). Performance results are reported in Table 6.
measure, as defined in Section 3. To obtain global As we can see, the average _-F and _-BEP are
estimates relative to experiments performed over a equal to 66.08 and 66.31, respectively.
set of categories, the standard definition of micro In Table 7, we summarize the performance values
averaged Precision and Recall was used. for the five most frequent MeSH categories, along
with the characteristics of the respective best
6.4 Results with REUTERS (Cross Validation) classifiers, averaged over the five folds. Again, we
The first data set we considered is the REUTERS- notice the compactness of classifiers (at most 40
21578 and the task was to assign documents to one rules for category C23).
or more categories of R90. As already mentioned, TABLE 7
performance evaluation was based on fivefold OHSUMED—Cross-Validation Results for the
cross validation. At each fold, we conducted a Five Most Frequent MeSH Categories
number of experiments for the induction of the best
classifier of each category in R90 according to the
algorithm sketched in Fig. 2. In particular, the
algorithm was executed with input vocabularies.

6.4.1 Performance

313
Proceedings of the National Conference on Recent Trends in Engineering Sciences 2k11

REUTERS-21578, OHSUMED, and ODP. We also


conducted a thorough comparative study against
NB, polynomial SVM, Ripper, C4.5, and LR. We
found that Olex consistently achieves
comparatively high-performance results,
significantly outperforming most of the other
approaches. In addition, it showed to be very
efficient (by far the fastest method, apart from NB).
Results on ODP demonstrate that Olex behaves
well also on large data sets (from both predictive
accuracy and efficiency viewpoints).

6.6 Results with ODP-S25 (Holdout Method) REFERENCES


ODP-S25 is a data set one order of magnitude [1] D.D. Lewis and P.J. Hayes, “Guest Editors’
larger than both REUTERS-21578 and Introduction to the Special Issue on Text
OHSUMED. In particular, the most frequent Categorization,” ACM Trans. Information Systems,
category “biology” owns 28,286 documents, and [2] F. Sebastiani, “Machine Learning in Automated
there are four categories with more than 10,000 Text Categorization,” ACM Computing Surveys,
documents. These characteristics make ODP-S25 [3] W.W. Cohen and Y. Singer, “Context-Sensitive
very suitable for testing both the accuracy and the Learning Methods for Text Categorization,” ACM
efficiency of the learning procedure of Olex on Trans. Information Systems,
large categories. Because of the large size of this [4] D.E. Johnson, F.J. Oles, T. Zhang, and T.
data set, experiments were conducted based on the Goetz, “A Decision-Tree- Based Symbolic Rule
holdout method. In particular, according to the Induction System for Text Categorization,” [5] W.
learning procedure of Fig. 2, we learned classifiers Li, J. Han, and J. Pei, “Cmar: Accurate and
over the training set and selected the best ones over Efficient Classification Based on Multiple-Class
the validation set. Then, we assessed their Association Rule,” Proc. First IEEE Int’l Conf.
performance over the test set. The learning Data Mining (ICDM), 2001.
procedure was run with the same input [6] J.R. Quinlan, “Generating Production Rules
vocabularies used for the previous two data sets from Decision Trees,” Proc. 10th Int’l Joint Conf.
(and α=0.5). [7] C. Apte´, F.J. Damerau, and S.M. Weiss,
“Automated Learning of Decision Rules for Text
7. CONCLUSION Categorization,” ACM Trans. Information Systems,
We have presented Olex, a novel approach to the [8] P. Rullo, C. Cumbo, and V.L. Policicchio,
automatic induction of rule-based text classifiers. “Learning Rules with Negation for Text
In the first part of this paper, we provided a formal Categorization,” Proc. 22nd Ann. ACM Symp.
description of the method. In particular, we Applied Computing (SAC ’07),
described the problem of determining a best set of [9] L.G. Valiant, “A Theory of the Learnable,”
discriminating terms for a category and proved its Proc. 16th Ann. ACM Symp. Theory of Computing
intractability. Then, we showed how rules are (STOC ’84),
derived from a given set of discriminating terms. [10] M. Anthony and N. Biggs, Computational
The Olex’s hypothesis language consists of rules Learning Theory. Cambridge Univ.
with one positive conjunction of terms and (zero
or) more negative ones. Thus, Olex predictions
require testing the simultaneous presence of several
terms (forming the positive conjunction) along with
the simultaneous absence of several sets of terms
(each forming a negative conjunction). While there
is in the literature a wide range of rule learning
algorithms, one contribution of this paper is the
form of the hypothesis language. We performed
experiments on three data collections, namely,

314

You might also like