You are on page 1of 5

978-1-4799-0661-1/13/$31.

00 2013 IEEE
Web Page Classification Using Firefly Optimization

Esra Sara
ukurova University, Faculty of Engineering and
Architecture, Department of Computer Engineering
Balcal, Saram, Adana, 01330 Turkey
Selma Aye zel
ukurova University, Faculty of Engineering and
Architecture, Department of Computer Engineering
Balcal, Saram, Adana, 01330 Turkey


Abstract Increase in the amount of information on the Web has
caused the need for accurate automated classifiers for Web pages
to maintain Web directories and to increase search engines
performance. As every (HTML/XML) tag and every term on
each Web page can be considered as a feature, we need efficient
methods to select best features to reduce feature space of the Web
page classification problem. In this study, our aim is to apply a
recent optimization technique namely the firefly algorithm (FA),
to select best features for Web page classification problem.
The firefly algorithm (FA) is a metaheuristic algorithm, inspired
by the flashing behavior of fireflies.
In this study, we use FA to select a subset of features, and to
evaluate the fitness of the selected features J48 classifier of the
Weka data mining tool is employed. WebKB and Conference
datasets were used to evaluate the effectiveness of the proposed
feature selection system. We observed that when a subset of
features are selected by using FA, WebKB and Conference
datasets were classified without loss of accuracy, even more, time
needed to classify new Web pages reduced sharply as the number
of features were decreased.
Keywords- Firefly Algorithm; Web page classification;
Classification; Feature selection
I. INTRODUCTION
As the popularity of the Web increases, the amount of
information on the Web has also increased. This information
growth has caused the need for accurate and fast classification
of Web pages to increase search engines performance.
Automatic Web page classification is a supervised learning
problem in which a set of labeled Web documents is used for
training a classifier, and then the classifier is employed to
assign one or more predefined category labels to future Web
pages [1]. Automatic Web page classification is not only used
for improving search engines performance, it is also essential
to the development of Web directories, to topic-specific Web
link analysis, to contextual advertising, to analysis of the
topical structure of the Web, and to improve the quality of
Web search [1]. Several classification methods such as
decision trees, Bayesian classifier, support vector machines, k
nearest neighbors have been developed [2]. Among these
methods, decision trees, and support vector machines are
suitable for classification problems in which number of
features is small [2]. Web page classification problem, on the
other hand, is a high dimensional problem since each term in
each HTML or XML tag of each Web page can be taken as a
feature.
In this study, we propose a firefly algorithm (FA) based
wrapper technique which finds the best features for Web
pages, to make fast and accurate classification. Firefly
algorithm (FA) is a recent search and optimization technique,
which was first introduced by Xin-She Yang in 2008 [3]. The
primary purpose for a firefly's flash is to act as a signal system
to attract other fireflies. The main rules of the algorithm are as
follows [3]:

All fireflies are unisexual
Attractiveness is proportional to their brightness, and
for any two fireflies, the less brighter one will be
attracted by the brighter one; however, the brightness
can decrease as their distance increases
If there are no fireflies brighter than a given firefly, it
will move randomly.

In this study, our aim is to choose the best n features
among hundreds of features that were extracted from Web
pages to reduce feature space of the Web page classification
problem [4, 5] to reduce the time needed to classify new
(unseen) Web pages. To our knowledge there is not such a
study that uses the firefly algorithm for feature selection from
Web pages.
This paper is organized as follows: in the next section, we
give more detail about Web page classification, and
summarize related work on the FA applications. The third
section describes our FA-based feature selection system. The
data sets used in this study and the experimental results are
presented in the fourth section. Finally, the fifth section
concludes the study.
II. RELATED WORK
Web page classification problem is defined as the problem
of assigning a Web page to one or more predefined category
labels [1]. In this study, our aim is to determine the role of a
Web page such as to decide whether the Web page is a
student home page, or a course page, or a department
home page. While doing that, we give a single class label
(e.g. course page) to each Web page, and we make binary
classification in which we categorize instances into exactly
one of the two classes (e.g. course page, or not course
page). This kind of classification problem exists especially in
focused crawling systems of vertical search engines. It is also
possible to extend the solution technique developed in this
study to other binary classification problems.
In [6], the FA is used for clustering on benchmark
problems and the performance of the FA is compared with
other two nature inspired techniques, Artificial Bee Colony
(ABC) and Particle Swarm Optimization (PSO). According to
this study, average classification error rates are as follows;
11.36%, 13.13% and 15.99% for the FA, ABC, and PSO
respectively.
In [7], a new feature selection approach that combines the
Rough Set Theory (RST) with nature inspired firefly algorithm
is presented. The algorithm simulates the attraction system of
real fireflies that guides the feature selection procedure. The
experimental result proves that the proposed algorithm scores
over other feature selection method in terms of time and
optimality. They have reduced number of features from 13 to
6-7 with PSO to 3 with FA for Cleveland Heart dataset.
III. FEATURE SELECTION USING THE FIREFLY ALGORITHM
Firey Algorithm (FA) is an optimization technique,
developed recently by Xin-She Yang at Cambridge University
[3]. It is inspired by social behavior of reies and the
phenomenon of bioluminescent communication. Fireflies can
generate light inside of it. Light production in fireflies is due
to a type of chemical reaction. It is thought that light in adult
fireflies was originally used for similar warning purposes, but
evolved for use in mate or sexual selection via a variety of
ways to communicate with mates in flirtations. Although they
have many mechanisms, the interesting issues are what they
do for any communication to find food and to protect
themselves from enemy hunters including their successful
reproduction.
In general, the pattern of flashes is unique for a particular
species of fireflies. The flashing light is generated by a
chemical process of bioluminescence. However, two
fundamental functions of such flashes are i) to attract mating
partners or communication, and ii) to attract potential victim.
Flashing may also be used for a protective warning
mechanism. The light intensity at a particular distance from
the light source follows the inverse square law. It means that,
as the distance increases, the light intensity decreases.
Furthermore, the air absorbs light which becomes weaker and
weaker as there is an increase of the distance. There are two
combined factors that make most fireflies visible only to a
limited distance that is usually good enough for fireflies to
communicate each other. The flashing light can be formulated
in such a way that it is associated with the objective function
to be optimized. This makes it possible to formulate new
metaheuristic algorithms. The main steps of FA described in
Figure 1.








1. Generate initial population of fireflies (x
i
)
2. Determine light intensity I
i
for each firefly x
i
by f(x
i
)
where f(x
i
) is the objective function value for by x
i

3. Define light absorption coefficient
4. While (t<MaxGeneration) do
4.1 for each firefly x
i
do
for each firefly x
j
do
if (I
j
> I
i
) then
Move firefly i towards j in d-dimension
endif
Update attractiveness with distance r via exp
-r
Evaluate new solution and light intensity
end for
end for
4.2 Rank the fireflies and find the current best one
4.3 Increment t
end while
5. Display the best firefly
Fig 1. Firefly Algorithm

In this study we developed an FA based [3] algorithm to
select best features for Web pages to provide accurate and fast
classification. For this purpose, first of all, we extracted
features which consist of all of the stemmed words that are not
stopwords, from each of the positive documents in the training
dataset. Feature extraction is performed only once for all
datasets as a pre-processing step. After extracting features,
document vectors for the Web pages are created by counting
the occurrences of each feature in each Web page. Then,
document vectors are normalized. After that FA was used for
feature selection.
In the proposed feature selection method, each feature
represents a node, and all nodes are independent. Nodes (i.e.
features) were selected according to their selection probability
P
k
(i) which is the document frequency (df) value of each term
After the probability evaluation, a roulette wheel selection
algorithm was used for selecting the next feature [8]. F-
measure values of selected subsets are used as objective
function f(x
i
). The main steps of our FA based feature
selection algorithm is described in Figure 2.
First of all, we have generated our initial population having
a pre-determined number of fireflies. And then initial light
intensities of features are defined by df values of features. We
have chosen df values as the light intensities of features
because df is an important metric for classification accuracy
and features attractiveness.
In the first step, each firefly chose randomly three unique
features. When all fireflies complete their subset selection
process, two arff files (i.e., train and test files) were generated
for each firefly. By using the train dataset a decision tree
classifier is learned by using the J48 classifier of Weka data
mining tool. After that, the test dataset is classified. F-measure
value of the classification process is computed. These steps are
repeated for all fireflies and then the best one of these k
fireflies is founded, and the other fireflies have forced to seem
the best firefly. The best firefly is updated as the most
attractive one and the light intensity of this fireflys features
are updated by using F-measure value of the best fireflys
solution. The formula of light intensity update is computed as
follows;
df(i)= df(i)*exp
-*F-measure

Where =-1, and i best fireflys subset and F-measure value
is belongs to best one. Our purpose is the increase F-measure
value, because of this F-measure value is used in light
intensity update process.
The proposed algorithm includes two feature inclusion
functions for second step. If firefly is the local best firefly, this
firefly chose randomly new term from unselected term list. In
other case, fireflies chose randomly a term from the best
fireflys selected term list. For each firefly x
i
, two arff files
were generated for each firefly (train and test phase). F-
measure values of all fireflies were computed and, the local
best has found. The above processes are repeated until t equals
to MaxGeneration value. Then, the algorithm has completed
and, each firefly has n features.

1. Generate initial population of fireflies (x
i
)
2. Determine light intensity I
i
for each feature by their df
values
3. While (t<MaxGeneration) do
3.1 for each firefly x
i
do
for each firefly x
j
do
if (I
j
> I
i
) then
Move firefly i towards j in d-dimension
endif
Update attractiveness with respect to F-measure

Evaluate new solution and light intensity
end for
end for
3.2 Rank the fireflies and find the current best one
3.3 Increment t
end while
4 Display the best firefly
Fig 2. Firefly Feature Selection Algorithm
IV. EXPERIMENTAL EVALUATION AND RESULTS
All the implementations for the experiments were made in
Java programming language under Eclipse environment [9].
The proposed method was tested under Microsoft Windows XP
SP3 operating system. The hardware used in the experiments
had 1 GB of RAM and Intel Core2Duo 1.60 GHz processor.
A. FA Parameters
In the proposed FA based feature selection method, there
are several parameters such as the number of fireflies, the
number of features an so on.
The number of fireflies is defined as 30 for this method,
according to our experiments on ACO and IWD algorithms 30
was the optimum number for ants and intelligent water drops.
In order to compare this study with our previous studies, we
used 30 fireflies [4, 5, 10]. The parameter is defines as -1.
We have defined MaxGeneration number as 30
experimentally, because after first ten steps, we observed that
fireflies can find the best feature subset for classification.
B. Datasets
Two datasets namely the Conference, and the WebKB [11]
were used in the experiments. The Conference dataset consists
of the Computer Science related conference homepages that
were obtained from the DBLP web site [12]. We labeled the
conference Web pages as positive documents in the dataset. To
complete the dataset, the short names of the conferences were
queried using the Google search engine, and the irrelevant
pages in the result set were taken as negative documents. Then,
all the positive and the negative documents were randomly
distributed among the train and the test datasets. The
Conference dataset contains 2369 Web pages in total and 824
of them are conference homepages. We used 75% and 25%
split for the training and the testing. The number of pages in the
train and test datasets are presented in Table I. For this dataset
our aim was to determine whether a Web page is a conference
homepage or not.
The WebKB dataset is a well-known dataset that is
obtained from the WebKB project [13]. The WebKB dataset
contains course, department, faculty, project, staff, and student
Web pages gathered from the Computer Science departments
of the Cornell, Texas, Washington, and Wisconsin universities
as well as some irrelevant pages from those four universities.
We used course, faculty, project and student classes in our
experiments since these classes have more instances than
others. The dataset contains 7648 Web pages in total such that
it has 883 course, 1028 faculty, 493 project, and 1480 student
homepages, and 3764 negative (belongs to other classes) Web
pages. The train and the test datasets were constructed as
described in the WebKB project Web site [13]. For this study
we used pages from Cornell, Texas, and Washington
universities in the training, and pages from Wisconsin
university in the test phase. We used the WebKB dataset as a
binary class classification dataset. For example the Course
dataset contains 883 course and 3764 negative pages, the
Faculty dataset has 1028 faculty and 3764 negative pages, and
so on (Table I.).
TABLE I. TRAIN/TEST DISTRIBUTION OF WEBKB AND
CONFERENCE DATASETS FOR BINARY CLASS CLASSIFICATION

Train
Relevant/Non-relevant
Test
Relevant/Non-relevant
Course 846 / 2822 86 / 942
Project 840 / 2822 26 / 942
Student 1485 / 2822 43 / 942
Faculty 1084 / 2822 42 / 942
Conference 618/1159 206/386
C. Feature Extraction and Selection
For each dataset, the features were extracted by taking
stemmed terms that are not stopwords from the <title> tags and
URLs of the positive (i.e., relevant) Web pages in the training
set. As an example, the Course dataset has 305 features
extracted from the <title> tag as shown in Table II. We used
the features extracted from <title> tag and URLs of the Web
pages since in our previous studies [4, 5, 10] we observed that
titles and URL addresses of Web pages contain important
features.
TABLE II. NUMBER OF FEATURES EXTRACTED FOR ALL
CLASSES
Class

<title> tag URL
Course 305 479
Project 596 686
Student 1987 1557
Faculty 1502 1208
Conference 890 1115

After the feature extraction step, for each dataset, the FA
based feature selection process was performed. After selecting
the best n features by using our FA based algorithm, test
(unseen) Web pages were classified with respect to the
selected n features by using the J48 classifier of Weka [14].
The proposed algorithm has run with different n values with
respect to our previous studies [4, 5, 10]. Selected n values
are; 10, 30, 50, 60, 100.
D. Results
In this study, each firefly chooses a predefined number of
features. Performance of the proposed FA based method with
respect to F-measure and run time can be seen in Table III and
Table IV.
TABLE III. PERFORMANCE OF THE PROPOSED FA BASED
FEATURE SELECTION ALGORITHM FOR FEATURES EXTRACTED
FROM URL
URL TAGS Average F-Measure Values and Run Times
# of
Features
Course Project Faculty Student Conference
10
0.942 0.995 0.855 0.585 0.978
<1 sec <1 sec <1 sec <1 sec <1 sec
30
0.934 0.981 0.757 0.824 0.563
<1 sec <1 sec <1 sec <1 sec <1 sec
50
0.939 0.979 0.876 0.824 0.976
1 sec 1 sec 1 sec 1 sec 1 sec
60
0.948 0.987 0.855 0.753 0.976
1 sec 1 sec 1 sec 1 sec 1 sec
100
0.934 0.980 0. 855 0.585 0.976
1.5 sec 1.5 sec 1.5 sec 1.5 sec 1.5 sec
All
features
1.0 1.0 1.0 1.0 0.972
2 sec 4 sec 4 sec 8 sec 10 sec




TABLE IV. PERFORMANCE OF THE PROPOSED FA BASED
FEATURE SELECTION ALGORITHM FOR FEATURES EXTRACTED
FROM TITLE TAGS
Title TAGS Average F-Measure Values and Run Times
# of
Features
Course Project Faculty Student Conference
10
0.942 0.985 0.943 0.934 0.522
<1 sec <1 sec <1 sec <1 sec <1 sec
30
0.934 0.984 0.902 0.934 0.487
<1 sec <1 sec <1 sec <1 sec <1 sec
50
0.939 0.98 0.89 0.628 0.508
1 sec 1 sec 1 sec 1 sec 1 sec
60
0.948 0.985 0.942 0. 628 0.487
1 sec 1 sec 1 sec 1 sec 1 sec
100
0.934 0.985 0.906 0.919 0.504
1.5 sec 1.5 sec 1.5 sec 1.5 sec 1.5 sec
All
features
0.898 0.985 1.0 0.679 0.372
5 sec 7 sec 8 sec 23 sec 6 sec

According to Tables III and IV, reducing the feature space
by using the proposed FA based feature selection algorithm
does not negatively affect the run times of the classification.
Performance of the proposed FA feature selection
algorithm is compared with the Ant Colony Optimization
(ACO) and the Intelligent Water Drops [15] algorithms. The
results of this experiment for the WebKB and Conference
datasets are presented in Table V.
TABLE V. COMPARISON OF THE PROPOSED FA BASED
FEATURE SELECTION ALGORITHM FOR WEBKB AND CONFERENCE
DATASETS WITH ACO AND IWD (30 FEATURES)
TITLE TAGS F-Measure Values
Course Project Faculty Student Conference
FA 0.934 0.981 0.757 0.824 0.563
ACO 0.880 0.983 0.917 0.927 0.737
IWD 0.914 0.975 0.914 0.749 0.732

The proposed algorithm compared with ACO [16] with
respect to their run time. For Faculty dataset, ACO has needed
4 hours 29 min and 24 sec to select 100 features in 250
iterations, firefly algorithm has needed only 34 min 53 sec to
select best 100 features in 250 iterations. Firefly algorithm can
quickly reach optimum subsets by making fewer number of
computations as the FA has less parameters in contrast with
ACO and IWD algorithms. The training times of the proposed
FA based feature selection method for 30 iterations can be
seen in table VI and VII. In our experiments, we observed that
fireflies can find the best feature subset for classification after
first ten steps. So, iteration number is defines as 30. And all
given result tables are belong to 30 iteration experiments.

TABLE VI. TRAINING TIMES OF THE PROPOSED FA FEATURE
SELECTION ALGORITHM FOR WEBKB AND CONFERENCE
DATASETS WITH URL TAGS
URL TAGS Training Times
#
Features
Course Project Faculty Student Conference
10 2.27 min 2.16 min 2.32 min 3.01 min 1.18 min
30 2.03 min 2.50 min 2.23 min 3.14 min 1.16 min
50 2.19 min 2.19 min 2.33 min 3.01 min 1.18 min
60 2.13 min 2.23 min 3.11 min 3.01 min 1.08 min
100 1.18 min 2.17 min 2.50 min 2.41 min 1.47 min

TABLE VII. TRAINING TIMES OF THE PROPOSED FA FEATURE
SELECTION ALGORITHM FOR WEBKB AND CONFERENCE
DATASETS WITH TITLE TAGS
Title TAGS Training Times
#
Features
Course Project Faculty Student Conference
10 1.30 min 1.35 min 1.47 min 2.25 min 49 sec
30 1.32 min 1.27 min 1.39 min 2.28 min 49 sec
50 1.35 min 1.30 min 1.43 min 2.38 min 1.02 min
60 1.30 min 1.38 min 1.52 min 2.34 min 50 sec
100 1.40 min 1.25 min 1.57 min 2.58 min 57 sec

Finally, performance of the proposed FA feature selection
algorithm is compared with pure J48 classifier. For this
purpose, F-measure values of the J48 classifier with and
without the proposed FA based feature selection are computed
and compared.
V. CONCLUSION
In this study we have developed a Firefly algorithm which
finds best n features for classification to reduce time required
for the classification of the unseen Web pages. We observed
that by using the selected features, Web pages can be
classified faster without loss in the quality of the classification.
Even more in some cases, the f-measure of the classification is
improved by making feature selection; since it allows
removing unnecessary features that reduces the classification
accuracy. As future work, cross validation of the experiments
can be performed, and experiments can be repeated for other
datasets and for other HTML tag sets.
REFERENCES

[1] X. Qi, and B.D. Davison, Web page classification: features and
algorithms ACM Computing Surveys, vol. 41 2, (2009) Article 12.
[2] J. Han, and M. Kamber, Data Mining: Concepts and Techniques.
Morgan Kaufmann, San Francisco, 2006.
[3] Yang, X. S. Nature-Inspired Metaheuristic Algorithms. Frome: Luniver
Press. (2008). ISBN 1-905986-10-6.
[4] S.A. zel, A Web Page Classification System Based on a Genetic
Algorithm Using Tagged-Terms as Features. Expert Systems with
Applications, vol. 38 (4), (2011) 3407-3415.
[5] S.A.zel, A Genetic Algorithm Based Optimal Feature Selection for
Web Page Classification. International Symposium on INnovations in
Intelligent SysTems and Applicaitons (INISTA 2011), stanbul (2011)
282-286.
[6] J. Senthilnath, S.N. Omkar, V. Mani, Clustering using firefly algorithm:
Performance study. Swarm and Evolutionary Computation Volume 1,
Issue 3, September 2011, Pages 164171
[7] H. Banati, and M. Bajaj, Fire Fly Based Feature Selection Approach.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4,
No 2, July 2011
[8] Bck, Thomas, Evolutionary Algorithms in Theory and Practice (1996),
p. 120, Oxford Univ. Press
[9] Eclipse environment. available at
http://www.oracle.com/technetwork/developer-
tools/eclipse/downloads/index.html.
[10] E. Sara, and S.A. zel, URL Tabanl Web Sayfas Snflandrma, Akll
Sistemlerde Yenilikler ve Uygulamalar Sempozyumu (ASYU 2010),
Kayseri (2010) 13-17.
[11] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K.
Nigam, and S. Slattery, Learning to extract symbolic knowledge from
the World Wide Web. The 15th National Conference on Artificial
Intelligence, AAAI Press (1998) 509- 516.
[12] The DBLP Computer Science Bibliography. available at:
http://www.informatik.uni-trier.de/~ley/db/
[13] CMU World Wide Knowledge Base (Web->KB) project. available at:
http://www.cs.cmu.edu/~webkb/
[14] Weka 3: Data Mining Software in Java. available at:
http://www.cs.waikato.ac.nz/~ml/weka/
[15] S.A. zel and E. Sara, Feature Selection for Web Page Classification
Using the Intelligent Water Drop Algorithm. In: Proceedings of the
2
nd
World Conference on Information Technology (WCIT 2011),
November 23-26, 2011, Antalya.
[16] E. Sara. Web Page Classification Using Ant Colony Optimization, M.S.
Thesis, ukurova University, Institute of Natural and Applied Sciences,
Department of Computer Engineering (2010).

You might also like