You are on page 1of 16

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

Hang Yang and Simon Fong


Department of Computer and Information Science, University of Macau, Taipa, Macau SAR, China {ya97404,ccfong}@umac.mo

Abstract. The Very Fast Decision Tree (VFDT) is one of the most important classification algorithms for real-time data stream mining. However, imperfections in data streams, such as noise and imbalanced class distribution, do exist in real world applications and they jeopardize the performance of VFDT. Traditional sampling techniques and post-pruning may be impractical for a non-stopping data stream. To deal with the adverse effects of imperfect data streams, we have invented an incremental optimization model that can be integrated into the decision tree model for data stream classification. It is called the Incrementally Optimized Very Fast Decision Tree (I-OVFDT) and it balances performance (in relation to prediction accuracy, tree size and learning time) and diminishes error and tree size dynamically. Furthermore, two new Functional Tree Leaf strategies are extended for I-OVFDT that result in superior performance compared to VFDT and its variant algorithms. Our new model works especially well for imperfect data streams. I-OVFDT is an anytime algorithm that can be integrated into those existing VFDT-extended algorithms based on Hoeffding bound in node splitting. The experimental results show that I-OVFDT has higher accuracy and more compact tree size than other existing data stream classification methods. Keywords: Data stream mining, decision tree classification, optimized very fast decision tree, incremental optimization.

Introduction

Decision tree learning is one of the most significant classifying techniques in data mining and has been applied in many areas, including business intelligence, healthcare and biomedicine. The traditional approach to building a decision tree, designed by Greedy Search, loads a complete batch of training data into memory and partitions the data into a hierarchy of nodes and leaves. The tree cannot be changed when new data are acquired, unless the whole model is rebuilt by reloading the complete set of historical data together with the new data. This approach may be unsuitable for unbounded input data, such as data streams in which new data continuously flow in at high speed [1].

R. Benlamri (Ed.): NDT 2012, Part I, CCIS 293, pp. 281296, 2012. Springer-Verlag Berlin Heidelberg 2012

282

H. Yang and S. Fong

One challenge to decision tree induction is associated with the quality noisy data and imbalanced class distribution of the data streams, which generally render a data stream imperfect in this context. Noisy data that influence decision models and cause over-fitting problems exist in real world mass data mining. The term imbalanced data refers to irregular class distributions in a data set, i.e., a large percentage of training samples may be biased toward class A, leaving few samples that describe class B. Those imperfections significantly impair the accuracy of a decision tree classifier through the confusion and misclassification prompted by the inappropriate data. The size of a decision model will also grow excessively large under noisy data, an undesirable effect known as over-fitting. In data stream mining, the decision tree learning algorithms construct a decision model incrementally over time. The implementation environment is non-stationary and computational resources are limited. VFML [2] is a C-based tool for mining timechanging, high-speed data streams. MOA [3] is Java-based software for massive data analysis. In both platforms, the parameters of VFDT must be pre-configured by users. For different tree induction tasks, the parameter setup is distinguished. We cannot know what the best configuration is until all possibilities have been tried. However, this is a barrier to using it in real-time applications because there is not enough time to implement the best setup searching under non-stopping data environments [4]. The objective of this study can be expressed in the following question: how can incremental optimization help improve the performance of a data stream mining classifier under imperfect data stream inputs? Following this primary research objective, the existing computational techniques used for incremental optimization are extended. An innovative and effective incremental optimization model called the Incrementally Optimized Very Fast Decision Tree (I-OVFDT) is presented here. Extensive experiments are conducted and the results are critically analyzed. For further experiments, the new incremental optimization model is infused with other similar data stream mining algorithms wherever technically possible and included in comparative tests vis--vis with I-OVFDT. It is anticipated that the new model will contribute valuable theoretical knowledge to the data mining community. The contributions of I-OVFDT can be summarized briefly as follows. (1) It pioneers the combination of incremental optimization with decision tree models for high-speed data streams. (2) It proposes an optimization algorithm in which the parameters for tree growing are automatically computed instead of taken from fixed values. (3) It proposes an incremental model that balances the accuracy, tree size and learning time of tree models. (4) This optimization algorithm is also suitable for other tree inductions that inherit a similar node-splitting principle (using HB). The remainder of this paper is structured as follows. We review the original decision tree algorithms and the mechanism of controlling tree growth in Section 2. Assumptions, metrics definitions and the optimization mechanism are given in Section 3. The tree building approach is presented in Section 4. An evaluation and experiment provide evidence that I-OVFDT delivers better performance than VFDT in Section 5. Conclusions are drawn in Section 6.

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

283

2
2.1

Decision Tree Learning For Data Streams


Decision Tree Using Hoeffding Bound

A decision-tree classification problem is defined as follows. Si is a set of data stream with the form (X, y), where X is a vector of d attributes and y is the actual discrete class label. Attribute Xi is the i th attribute in X and is assigned a value of Xi1, Xi2 Xij, where 1 i, j d. Suppose that HT is a tree induction using HB, such as VFDT. The classification goal is to produce a decision tree model from N examples that predicts the classes in future examples with a high accuracy. In data stream mining, the example size is unbounded that N. A VFDT algorithm [1] constructs a decision tree by using constant memory and constant time-per-sample. The tree is built by recursively replacing leaves with decision nodes. Sufficient statistics nijk of attribute Xij values are stored in each leaf yk. A heuristic evaluation function is used to determine split attributes for converting leaves to nodes. Nodes contain the split attributes and leaves contain only the class labels. The leaf represents a class according to the sample label. The main elements of VFDT include a tree-initializing process that initially contains only a single leaf and a tree-growing process that contains a splitting check that uses a heuristic evaluation function G(.) and HB. VFDT uses information gain as G(.). = (1)

The necessary number of samples (sample#) uses HB, shown in (1), to ensure control over errors in the attribute-splitting distribution selection. In the past decade, VFDTextended variants have been developed by extending the original VFDT, which is based on the principles of using HB. Table 1 provides a comparison of these VFDTextended studies and records their pros and cons.
Table 1. VFDT-extended Tree Models Meth VFDT[1] CVFDT [6] VFDTC [7] UFFT [9] HOT [8] OcVFDT [10] FlexDT [11] Strength Pioneer that uses HB for infinite data streams Uses sliding-window to deal with concept-drift Provides Functional Tree Leaf One pass algorithm, forest of trees detects concept-drift Provides optional sub-trees And high accuracy post-pruning Combines one-class classifier with VFDT Uses Sigmod function to handle imperfect streams Weakness Not for imperfect data streams A fixed tie threshold Makes it hard to detect conceptdrift quickly in cases of abrupt concept drift Not for real-world applications due to tree size explosion Builds a binary tree for each possible pair of classes, not a single tree induction approach Tree size explosion and slow computation speed Not for concept-drift Slow computation speed

284

H. Yang and S. Fong

2.2

Node-Splitting Evaluation

Extended from the original desiderata in VFDT, Gama et al. [5] identify three performance dimensions that significantly influence the learning process: Space: the available memory is fixed; Learning time: the rate of incoming examples processing; Generalization power: how effective a model is at capturing the true underlying concept. The focus is the generalized power of learning algorithm. Although they have recognized that the first two factors had a direct impact on the generalization power of the learning model, in this paper these three dimensions correspond to the tree size, the node-splitting time and the prediction accuracy. The data stream problem is simulated using a large number of instances, as many as one million for both datasets. The mining approach is a one-pass process that differs from the traditional approach of loading the full set of historical data. The accuracy, tree size and time are recorded as the pre-defined values of and nmin change. From our previous experimental results [4], we found that: In general, bigger tree size brought higher accuracy and more learning time despite possibly causing an over-fitting problem. A bigger produced faster tree-size growth and longer computation time, but because of the limited memory, when reached a threshold the tree size did not increase (0.7 in LED24; 0.4 in Waveform21). nmin is proposed to control the learning time, and a bigger nmin brought a faster learning speed but a smaller tree size and lower accuracy. The traditional approach to detecting the best parameter configurations for a certain task is to try all of the possibilities. This is impractical, however, for real-time applications. This paper proposes the novel concept of adaptively building an optimal tree model that combines with the incremental optimization mechanism, seeks a compact tree model and balances tree size, prediction accuracy and learning time on the fly. Consequently, the fixed installed parameters are replaced by an adaptive mechanism when new data arrive.

3
3.1

Incremental Optimization for Decision Tree Model


Assumption

Data arrive at the decision tree induction process with very little or no extra time available for refining the tree model. Intermittent pauses are assumed to be undesirable. Post-pruning mechanisms [12] eliminate the noisy tree-paths after the tree model has been established, making it unsuitable for use with data streams. Hence, this paper makes the following assumption: the implementation of a postpruning mechanism that stops the tree-building process to refine the model by pruning existing tree-paths is not allowed.

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

285

3.2

Optimization Goal

Suppose that the optimization problem is defined as a tuple ( , , ). The set X is a collection of objects and the feasible solution are subsets of X that collectively achieve a certain optimization goal. The set of all feasible solutions is 2 and : is a cost function of these solutions. A weight with every object x of X is defined as S = . The optimal solution OPT( , , ) exists if X and are awareness, and the subset , is optimizing . In decision tree form, the solution S is a decision tree model HTx, whose algorithm is based on Hoeffding tree (HT) using HB in node-splitting control. Therefore, the incremental optimization functions can be expressed as a sum of several sub-objective cost functions:

(2)

is a continuously differentiable function and M is the number of where objects in the optimization problem. I-OVFDT is a new methodology for building a desirable tree model by combining with an incremental optimization mechanism and seeking a compact tree model that balances the tree size, prediction accuracy and learning time. Consequently, the fixed installed parameters are replaced by an adaptive mechanism when new data arrive. We consider the optimization as a minimizing cost function problem that: (3)

in (2) that The proposed method will find a general optimization function simultaneously considers prediction accuracy, tree size and speed, where M = 3. 3.3 Metrics

When a new data stream arrives, it will be sorted from the root to a leaf in terms of the existing HT model. The data stream Si contains information (X, y), where X is a vector of d attributes and y is the actual discrete class label in a supervised learning process. Attribute Xi is the i th attribute in X and is assigned a value of Xi1, Xi2 Xij, where 1 i, j d. The decision tree algorithm uses = to predict the class when a new data sample (X, yk) arrives. The prediction accuracy is dynamically changing with the example size n growing in an incremental learning process, defined as: =

(4) = 1, 0, = (5)

To measure the utility of the three dimensions via the minimizing function in (3), the measure of prediction accuracy is reflected by the prediction error in (6): = =1 (6)

286

H. Yang and S. Fong

The classifications goal is to produce a decision tree model HT from N examples that predicts the class in future examples with accuracy . In data stream mining, the example size is very large, even unlimited, so that n. A tree path, traveling from the root to a leaf, represents a regression pattern the class stated in the leaf. When an internal node splits to create a new leaf, the total number of leaves grows. A decision model is a tree-like structure that presents the patterns of non-linear relationship mapping between X and the class by the tree-paths. The number of leaves in the decision model represents the number of patterns/rules in this model. Therefore, the definition of tree size is the number of leaves in the decision model. When a leaf is being generated, the tree size grows. The data flow continuously with the decision model incrementally refreshing each time a new leaf is created. Therefore, the tree size function is: = = 1 , , Previously, Section 2.1 illustrated the conditions of VFDT node splitting, which is the difference between inherits the use of HB in (1), where = the two highest-quality attributes. VFDT is a one-pass algorithm that builds a decision model using a single scan over the training data. The sufficient statistics that count the number of examples passed to an internal node, are the only updated elements in the one-pass algorithm. The calculation is a plus one incremental process that consumes little computational resources. Hence, the computation speed of this plus one operation for a new example passing is supposed as a constant value in the learning process. The number of examples that have passed within an interval period of in node splitting control determines the learning time. In VFDT, nmin is a fixed value for controlling the interval time checking node splitting. = = (8) Suppose that is the number of examples seen at a leaf yk and the condition that checks node-splitting is = 0. The learning time of each node splitting is the interval period the time defined in (8) during which a certain number of examples have passed. (7)

Fig. 1. A triangle of three-object utility models

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

287

Returning to the incremental optimization problem, the optimum tree model is the structure with the minimum . A triangle model is provided to illustrate the relationship amongst the three dimensions the prediction accuracy, the tree size and the learning time. The three dimensions construct a triangle utility function shown in Figure 1. A utility function computes the area of this triangle, reflecting a balance amongst the 3 objectives in (9): =

(9)

The area of this triangle changes when node splitting happens and the HT updates. A min-max constraint of the optimization goal in (3) controls the node within a splitting, which ensures that the new tree model keeps a considerable range. Suppose that . is a HT model with the maximum is a HT model with the minimum utility. The optimum utility so far and . model should be within this min-max range, near the . : . =
M . .

(10)

According to the Chernoff bound [15], we know: |Opt. . | (11)

where the range of is within the min-max model that Min. Opt. Max. . Therefore, if goes beyond this constraint the existing HT is not suitable to embrace the new data input and the tree model should not be updated. The node-splitting condition is adaptively re-optimized in OVFDT . , or Opt. such that: , or Opt. . , instead of a fixed tie-breaking threshold. 3.4 Embedded with Functional Tree Leaf

The Functional Tree Leaf [7], further enhancing the prediction accuracy via the embedded Nave Bayes classifier, makes the prediction of HT. I-OVFDT is an incremental optimization that uses prediction accuracy, tree size and learning time. Embedding the Functional Tree Leaf is proposed to improve the performance of prediction by HT model. When these two extensions an optimized tree growing process in the training phase and a refined prediction using the Functional Tree Leaf in the testing phase are used together, the new decision tree model is able to achieve unprecedentedly good performance in terms of high prediction accuracy and compact tree size, although the data streams are perturbed by noise and imbalanced class distribution. The proposed model combines I-OVFDT and the Functional Tree Leaf to produce consistently good performance compared to VFDT and its variants. For the actual classification, I-OVFDT uses = to predict the class label when a new sample (X, y) arrives. The predictions are made according to the observed class distribution (OCD) in the leaves, called Functional Tree Leaf . Originally in VFDT the prediction used only the majority class Functional Tree Leaf

288

H. Yang and S. Fong

MC. The majority class only considers the counts of the class distribution, but not the decisions based on attribute combinations. The nave Bayes Functional Tree Leaf NB
was proposed to compute the conditional probabilities of the attribute-values given a class at the tree leaves by nave Bayes. As a result, the prediction at the leaf is refined by the consideration of each attributes probabilities. To handle the imbalanced class distribution in a data stream, a weighted nave Bayes Functional Tree Leaf WNB and an adaptive Functional Tree Leaf Adaptive are proposed in this paper. Sufficient statistics nijk is an incremental count number stored in each node in the IOVFDT. Suppose that a node Nodeij in HT is an internal node labeled with attribute xij. Suppose that k is the number of classes distributed in the training data, where k2. A vector Vij is constructed from the sufficient statistics nijk in Nodeij, such that Vij = {nij1, nij 2nij k}. Vij is the OCD vector of Nodeij. OCD stores the distributed class count at each tree node in I-OVFDT, helping to keep track of the occurrences of the instances of each attribute. Majority Class Functional Tree Leaf. In the OCD vector, the majority class Functional Tree Leaf MC chooses the class with the maximum distribution as the predictive class in a leaf, where MC: arg max f = {ni,j,1, ni, j, 2 ni, j, r ni, j, k}, and where 0<r<k. Nave Bayes Functional Tree Leaf. In the OCD vector Vi,j = {ni,j,1, ni,j,2 ni,j,r ni,j,k}, where r is the number of observed classes and 0<r<k, the nave Bayes Functional Tree Leaf NB chooses the class with the maximum possibility, as computed by the Nave Bayes, as the predictive class in a leaf. nij,r is updated to ni,j,r by the nave Bayes function such that , , = P | P P , where X is the new arrival instance. Hence, the prediction class is NB: arg max i = { ni,j,1, ni,j,2 ni,j,r ni,j,k }. Weighted Nave Bayes Functional Tree Leaf. In the OCD vector Vi,j = {ni,j,1, ni,j,2 ni,j,r ni,j,k}, where k is the number of observed classes and 0<r<k, the weighted nave Bayes Functional Tree Leaf WNB chooses the class with the maximum possibility, as computed by the weighted nave Bayes, as the predictive class in a leaf. ni,j,r is updated to ni,j,r by the weighted nave Bayes function such that , , = P P , where X is the new arrival instance and the weight is the P | = probability of class i distribution among all the observed samples, such that , where ni,j,r is the count of class r. Hence, the prediction class is WNB: arg max f = { ni,j,1, ni,j,2 ni,j,r ni,j,k }. Adaptive Functional Tree Leaf. In a leaf, suppose that VMC is the observed class distribution vector with the majority class Functional Tree Leaf MC; suppose VNB is the observed class distribution vector with the nave Bayes Functional Tree Leaf NB and suppose that VWNB is the observed class distribution vector with the weighted nave Bayes Functional Tree Leaf WNB. Suppose that y is the true class of a new instance X and E is the prediction error rate using a Functional Tree Leaf . E is calculated by the average E=errori /n, where n is the number of examples and errori is the number of examples mis-predicted using . The adaptive Functional Tree Leaf chooses the class with the minimum error rate predicted by the other three strategies, where Adaptive: arg min = {EMC, ENB, EWNB}.

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

289

I-OVFDT Tree Building

The I-OVFDT tree-building approach is presented in the pseudo code in this section. On the base of the metrics, the input parameters are given in Figure 2. I-OVFDT is described in detail in Figure 3. For a new tree model, the tree should be initialized with a single root (Figure 4). When a new data stream arrives it traverses from the root to a predicted Functional Tree Leaf according to the existing tree model (Figure 5). If the node-splitting checks are met, the node-splitting estimation is implemented in Figure 6.

Fig. 2. Pseudo code of I-OVFDT input variables

Fig. 3. Pseudo code of I-OVFDT overall approach

Fig. 4. Pseudo code of I-OVFDT tree initializing

290

H. Yang and S. Fong

One of our innovations is the optimized node-splitting condition, which combines with the incremental optimization model and the prediction of the Functional Tree Leaf. This innovation not only suits the original VFDT, but also some extensions, such as CVFDT and HOT, by proposing incremental optimization for incremental tree-building induction that settles the problems of mining high-speed data streams.

Fig. 5. Pseudo code of I-OVFDT tree traversing

Fig. 6. Pseudo code of I-OVFDT tree growing

5
5.1

Experiment
Platform and Dataset

An I-OVFDT Java package integrated with a MOA toolkit was constructed as a simulation platform for experiments. The running environment was a Windows 7 PC with an Intel 2.8GHz CPU and 8Gb RAM. In all of the experiments, the parameters of the algorithms were = 10-6 and nmin = 200, the default values suggested by MOA,

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

291

where is the allowable error in split decisions, with values closer to zero taking longer to decide (used in HB calculation), and nmin is the default number of instances a leaf should observe between split attempts. This section provides evidence of the improvement that I-OVFDT delivers compared to the original VFDT and others. The experimental datasets, including pure nominal datasets, pure numeric datasets and mixed datasets, were either synthetics generated by the MOA generator or extracted from real-world applications that are publicly available for download from the UCI repository [9]. Each experimental dataset is described in Table 2. The generated datasets were also used in previous VFDT-related studies.
Table 2. Description of experimental datasets Name LED7 LED24 Waveform 21 Waveform 40 Random Tree Simple (RTS) Random Tree Complex (RTC) Cover Type (COVTYPE Nom# 7 24 0 0 10 50 42 Num# 0 0 21 40 10 50 12 Cls# 10 10 3 3 2 2 7 Type Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic UCI Size 106 106 106 106 106 106 581,012

Noise-Included Synthetic Data. LED data was generated by MOA. We added 10% noisy data to simulate imperfect data streams. The LED7 problem used 7 binary attributes to classify 10 different classes and the LED24 used 24 binary attributes. Waveform was generated by the MOA generator. The goal of this task was to differentiate between three different classes of Waveform. There were two types of waveform: Wave21 had 21 numeric attributes and Wave40 had 40 numeric attributes, all of which contained noise. Random Tree (RTS and RTC) was also generated by the MOA generator. The dataset was based on [12]. It built a decision tree by choosing attributes to split randomly and assigning a random class label to each leaf. As long as a tree was constructed, new samples were generated by assigning uniformly distributed random values to attributes. Those attributes determined the class label throughout the tree. UCI Data Cover Type was used to predict forest cover types from cartographic variables. It is a typical imbalanced class distribution data that all are real life samples. 5.2 Compared with VFDT

The first experiment compares I-OVFDT and VFDT with different tie-breaking thresholds, showing the performances of the three dimensions (prediction accuracy, tree size and learning time). In VFDT, the tie-breaking threshold is used to control tree-growth, which reflects tree size and learning time. This user-configured

292

H. Yang and S. Fong

parameter also influences prediction accuracy in VFDT. We cannot know what the best setup is until all possibilities have been tried an impractical scenario for online algorithms. I-OVFDT uses an incremental optimization model to balance the 3 factors. Based on the previous experiment [4] we know that in I-OVFDT, the bigger tree size led to a higher accuracy even in cases of an over-fitting problem, but it took more learning time. In Table 3, similar instances of accuracy have been highlighted. For example, in the LED7 dataset, the similar accuracy of VFDT (=0.6) is selected as a benchmark for comparison to I-OVFDT, but the tree size has been reduced 27% in IOVFDT. This phenomenon also appears in the other experimental data. Therefore, we find that I-OVFDT can obtain considerable accuracy, but for a smaller tree size than the original VFDT.
Table 3. I-OVFDT compared to VFDT (different ) Data LED7 Dimension Acc(%) Size(#) Time(s) LED24 Acc(%) Size(#) Time(s) WAVE21 Acc(%) Size(#) Time(s) WAVE40 Acc(%) Size(#) Time(s) RTS Acc(%) Size(#) Time(s) RTC Acc(%) Size(#) Time(s) I-OVFDT 73.82 2414 (-27%) 0.008 73.81 3074 (-20%) 0.007 80.65 2240 (-5%) 0.015 80.49 2345 (-1%) 0.026 93.00 2322 (-25%) 0.012 95.66 975 (-35%) 0.117 VFDT =0.2 73.15 577 0.014 73.10 510 0.032 80.64 2364 0.015 80.39 2369 0.026 91.93 3081 0.009 95.52 1546 0.074 VFDT =0.4 73.67 2189 0.007 73.57 1918 0.010 80.90 3557 0.010 80.89 3607 0.018 93.16 2683 0.010 95.55 1492 0.077 VFDT =0.6 73.82 3301 0.007 73.77 3738 0.006 80.90 3557 0.010 80.89 3607 0.018 93.16 2683 0.010 95.55 1492 0.077 VFDT =0.8 73.85 3326 0.005 73.80 3842 0.006 80.90 3557 0.010 80.89 3607 0.018 93.16 2683 0.010 95.55 1492 0.077

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

293

5.3

Compared with Functional Tee Leaf for VFDT

Majority class (MC) is the original function used in VFDT to predict class. A Nave Bayes (NB) Functional Tree Leaf [7] is proposed to improve the prediction accuracy embedded in the leaf. This paper provides a new Functional Tree Leaf strategy called the Weight Nave Bayes (WNB). In addition, the WNB has been integrated with a Hybrid Adaptive [3] functional leaf in MOA that uses the error rate as an adaptive scenario (ADP).

Fig. 7. Accuracy and tree size comparison of functional tree leaves for LED24 data

Fig. 8. Accuracy and tree size comparison of functional tree leaves for RTC data

In the second experiment, I-OVFDT is integrated with functional tree leaves, compared to VFDT. The experimental analysis (in Section 5.2) shows the best accuracy of VFDT (i.e., =0.8 in LED24; =0.4 in RTC). Figures 7 and 8 show that IOVFDT has obviously better performance than VFDT when integrated with functional tree leaves, consequently producing higher accuracy and much smaller size.

294

H. Yang and S. Fong

5.4

Compared with VFDT-extended Tree Models

The third experiment shows the performance of I-OVFDT when integrated with existing VFDT-extended algorithms. The experimental data are Cover Type downloaded from UCI machine learning repository. These data are a typical sample of imbalance class distribution problems. As a benchmark analysis, we use the VFDT-extended algorithms in MOA as follows: Ensemble HT. [13] is online bagging with 10 ensemble classifiers. Adaptive-Size Hoeffding Tree (ASHT) [14] is derived from the VFDT, but adds a maximum number of split nodes or size. Hoeffding Option Tree (HOT) [8] is derived from VFDT adding additional option nodes that allow several tests to be applied. The configuration is maxOptionPaths=5, =10-7 and =0.999. AdaHOT [14] is derived from HOT. Each leaf stores an estimation of current error. The weight of each node in the voting process is proportional to the square of the inverse of the error. The IOVFDT incremental optimization is added to the abovementioned VFDT-extended algorithms. As the experimental results in Figure 9 show, I-OVFDT integration produces better accuracy than that of VFDT-extended algorithms.

Fig. 9. I-OVFDT compares to some VFDT-extended models

Conclusion

Decision tree algorithms for mining data streams should be reliable and extendable. The original VFDT advocates a simple method of using Hoeffding Bound to control node splitting in tree construction. It succeeds as an anytime algorithm that only requires scanning data once and suits high-speed data environments. However, imperfect data causes problems such as over-fitting, tree size explosion and imbalanced class distribution that deteriorate classification accuracy. The original

Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

295

VFDT uses a fixed tie-breaking threshold to handle tree size explosion. However, a pre-defined threshold cannot fit all applications at one time. We do not know what the best setup is until all possibilities have been tried, reducing the applicability of VFDT in real-time applications. In this paper, we describe an incremental optimization model that combines with decision tree learning for mining data streams called the Incrementally Optimized Very Fast Decision Tree (I-OVFDT). A three-dimensional balance (prediction accuracy, tree size and learning time) is obtained that reduces the error and tree size. In addition, two new Functional Tree Leaf classification strategies are integrated into I-OVFDT, resulting in a better performance than the existing algorithms, especially for imbalanced class distribution data streams. Furthermore I-OVFDT is an anytime algorithm that can integrate with the existing VFDT-extended algorithms using HB. As the experimental results shown in Section 5, the advantages of I-OVFDT are significant.

References
1. Pedro, D., Geoff, H.: Mining high-speed data streams. In: Proc. of the Sixth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 7180. ACM (2000) 2. Geoff, H., Pedro, D.: VFML-a toolkit for mining high-speed time-changing data streams (2003), http://www.cs.washington.edu/dm/vfml/ 3. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive online analysis. Journal of Machine Learning Research 11, 16011604 (2010) 4. Yang, H., Fong, S.: Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 471483. Springer, Heidelberg (2011) 5. Gama, J., Sebastio, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pp. 329338. ACM, New York (2009) 6. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, pp. 97106 (2001) 7. Gama, J., Ricardo, R.: Accurate decision trees for mining high-speed data streams. In: Proc. of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 523528. ACM (2003) 8. Pfahringer, B., Holmes, G., Kirkby, R.: New options for Hoeffding trees. In: Proc. of the 20th Australian Joint Conference on Advances in Artificial Intelligence, Gold Coast, Australia, pp. 9099 (2007) 9. Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams. In: Proc. of the 2005 ACM Symposium on Applied Computing, Santa Fe, New Mexico, pp. 573577 (2005) 10. Chen, L., Yang, Z., Xue, L.: OcVFDT: one-class very fast decision tree for one-class classification of data streams. In: Proc. of the Third International Workshop on Knowledge Discovery from Sensor Data, pp. 7986. ACM (2009)

296

H. Yang and S. Fong

11. Sattar, H., Ying, Y.: Flexible decision tree for data stream classification in the presence of concept change, noise and missing values. Data Min. Knowl. Discov., 13845810 19(1), 95131 (2009) 12. Bradford, J., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.: Pruning Decision Trees with Misclassification Costs. In: Ndellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 131136. Springer, Heidelberg (1998) 13. Oza, N., Russell, S.: Online bagging and boosting. In: Artificial Intelligence and Statistics 2001, pp. 105112. Morgan Kaufmann (2001) 14. Kirkby, R.: Improving Hoeffding Trees. PhD thesis, University of Waikato, New Zealand (2008) 15. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Annals of Mathematical Statistics 23, 493507 (1952)

You might also like