You are on page 1of 7

International Journal of Advances in Science and Technology, Vol. 3, No.

4, 2011

Performance of Inductive Learning Algorithms on Medical Datasets


Angeline Christobel.Y1, Usha Rani Sridhar2 and Kalaichelvi Chandrahasan3
1

College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain angeline_christobel@yahoo.com College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain ama_usharani@yahoo.com

College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain kalai_hasan@yahoo.com

Abstract
Classification is a data mining technique that is used to classify objects based on their features into a predefined category. Decision tree induction is one of the most popular classification algorithms in data mining and machine learning which represents the results in a tree scheme. In this research paper, the performance of decision tree induction classifiers such as C4.5, CART, AD Tree and Random Forest are analyzed on various medical data sets. The algorithms are evaluated based on Accuracy, Error rate and Execution time.

Keywords: Data Mining, Classification, Induction 1. Introduction


Data mining is the extraction of hidden predictive information from large databases [1]. Classification is a process of finding a model that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown[2]. Classification is also called as supervised learning because the learning of the model is supervised, that is, each training instance is labeled indicating its class. Classification and prediction are the two forms of data analysis that can be used to extract models describing important data classes or to predict future trends. The classification predicts categorical (discrete, unordered) labels and prediction models continuous valued functions. The most popular classification and prediction methods are Decision trees, Rule based, Bayesian, Support Vector Machines, Artificial Neural Network, Ensemble methods and Lazy learners. In medical field, data mining with decision trees plays a vital role to diagnose the problem of patients. Decision tree classifiers are widely used due to the following reasons The classification rules are simple and easy to understand Faster than other classification methods Comparable classification accuracy with other methods The main objective of this paper is to compare C4.5, CART, AD Tree and Random Forest algorithms on different medical datasets Pima-Diabetes, Heart-statlog and sick obtained from UCI Machine Learning Repository[14] based on Accuracy, Error rate and execution time.

2. Decision Tree Induction


Decision tree is a classification method which generates a tree and a set of rules representing the model of different classes from a given data set. Domain knowledge is not required for the construction of decision tree. Decision tree algorithms use various ways of splitting a data into branch like

October Issue

Page 1 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 segments. It shows a flow chart like structure where each internal node in the decision tree denotes a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label [2]. They are simple and very powerful for multiple variable analyses. It can handle high dimensional data and provide good accuracy. Decision tree induction algorithms have been used for classification in many application areas, such as medicine, manufacturing and production, financial analysis, astronomy and molecular biology [2]. The basis of several commercial rule induction systems are decision trees. The C4.5, CART, AD Tree and Random Forest algorithms are discussed below.

2.1 C4.5 Algorithm


C4.5 is a popular and powerful decision tree classification algorithm used to generate a decision tree developed by Ross Quinlan. The successor of ID3 is C4.5. The decision tree is constructed in C4.5 with a divide and conquer approach. It eliminates the problem of unavailable values, continuous attributes value ranges, pruning of decision trees and rule derivation. In C4.5, every node in a tree is associated with a set of cases. Also these cases are assigned weights to take into account unknown attribute values. At the beginning, only the root is present and it is associated with the whole training set, and all the weights are equal to one. At each node the divide and conquer algorithm is executed, trying to exploit the locally best choice with no backtracking allowed. To build a decision tree, we deal with training set that have records with unknown attributes by considering only those records where those attribute values are available. By estimating the probability of various possible results, we can classify records that have unknown attribute values. C4.5 produces tree with variable branches per node. When a discrete variable is chosen as the splitting attribute in C4.5 there will be one branch for each value of attributes [6, 10].

2.2 CART Algorithm


CART (Classification and Regression Tree) is a binary recursive partitioning algorithm. The divide and conquer methodology is used to construct classification and regression trees. It is developed by Breiman, Friedman Olshen, and Stone in 1984. Bayesian model is a precursor to the CART algorithm. The set of rules in the CART algorithm are: Splitting each node in a tree, Deciding when a tree is complete, and Assigning each terminal node to a predicted value for regression. Regression type CART is a tree based modeling for continuous variables which use sum of squared errors as splitting criterion. Classification type CART is for discrete / categorical variables in which use Gini, Entropy and Twoing measures to produce completely pure nodes. For growing and pruning trees, Gini rule is used to find out the parent node. Then the CART repeats the search for each child node until it is not possible to grow them any further. Once the maximal tree is grown with sufficient data, CART determines the best tree by pruning technique. CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. The surrogate splitter contains information that is typically similar to what would be found in the primary splitter.

2.3 ADTree Algorithm


The AD Tree [Alternating decision tree] algorithm is an advanced decision tree learning algorithm that makes use of boosting to gain precision. Boosting is a meta-algorithm that puts many weak classifiers together to create one strong classifier. For alternating decision trees, boosting adds three nodes to the tree for each iteration. Then, the algorithm determines a place for the splitter node by analyzing all prediction nodes created by boosting. The alternating decision tree is a graph that is transversed in order to arrive at predictions. To gain prediction values, the algorithm takes the overall sum of all prediction nodes crossed in the transversal.

October Issue

Page 2 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 The alternating decision tree can make use of all the weak hypotheses in boosting to arrive at a single, easily-understood representation.

2.4 Random Forest


A Random Forest is an ensemble learning method that collects many un-pruned classification or regression decision trees and aggregate their results. The Random Forest algorithm was developed by Leo Briman and Adele Culter. It grows trees in parallel independently of one another. They are often used in very large datasets and a very large number of input variables. A random forest model is made up of hundreds of decision trees. It does not require tree pruning and it handles continuous, categorical variables and missing values. Random Forest can be used to generate tree-base clusters through sample proximity. The Random Forest algorithm is as follows: Let N be the number of trees to build for each of N iterations. 1. Select a new bootstrap sample from training set 2. Grow an un-pruned tree on this bootstrap. Splits are chosen by purity measures: Classification uses Gini or deviance, while regression uses squared error. 3. At each internal node, randomly select mtry(number of predictors to try at each split) predictors and determine the best split using only these predictors. Overall Prediction is made by majority voting (classification) or averaging (regression) the predictions of the ensemble. As it is parallel algorithm type, several random forests can be run on many machines and then aggregate the votes component to get the final result. As it has only two parameters i) the number of variables in the random subset ii) and the number of trees in the forest, it is user-friendly. For each tree grown, 33-36% samples are not selected in the bootstrap, called "Out Of Bootstrap" or "Out of Bag" (OOB) samples [14]. Predictions are made using these OOB samples as input. OOB estimate of error rate will be computed by aggregating the OOB predictions. As it generates an internal unbiased estimate of the test error, cross validation is not necessary. Build trees until the errors no longer decreases. The number of predictors determines the number of trees necessary for good performance.

3. Performance Evaluation
The performance of classifiers depends on the characteristics of the data to be classified. The different empirical tests that can be performed to compare the classifiers are holdout, random subsampling, k-fold cross validation and bootstrap methods. In this study, we have selected k-fold cross validation for evaluating the classifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subset or folds d1,d2,,dk, each approximately equal in size. The training and testing is performed k times. In the first iteration, subsets d2, , dk collectively serve as the training set in order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1, d3,, dk and tested on d2; and so no[2]. The accuracy of the classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data [2]. Performance of the selected algorithms is measured based on the Accuracy and Error rate from the confusion matrix obtained. The time taken to build the model is also considered for comparison. The Accuracy and Error is calculated as follows: Accuracy = (TP+TN) / (TP + FP + TN + FN) Error = (FP+FN) / (TP + FP + TN + FN) Where TP is the number of True Positives TN is the number of True Negatives FP is the number of False Positives FN is the number of False Negatives

October Issue

Page 3 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011

4. Experimental results
In this paper, 10-fold cross validation is applied for evaluating the performance of the classifiers. These data mining classification model were developed using data mining classification tool Weka version 3.6. We have used three datasets Pima-Diabetes, Heart-statlog and Sick obtained from the UCI Machine Learning Repository [11]. Algorithm for attribute selection was applied on dataset to preprocess the data. Table1 shows the characteristics of datasets such as the number of instances, the number of attributes and the number of classes it contains. Table 1: Characteristics of datasets Dataset Pima-diabetes Heart-statlog Sick No. of Instances 768 270 3772 No. of Attributes 9 14 30 No. of Classes 2 2 2

Table 2 shows the accuracy of various classifiers and Figure 1 shows the accuracy in graphical format. Table 2: Accuracy of classifiers Accuracy (%) Dataset C4.5 73.83 76.66 98.81 CART 75.13 78.52 98.88 AD Tree 72.92 78.52 98.06 Random Forest 73.83 78.15 98.38

Pimadiabetes Heart-statlog Sick

Figure 1: Graphical comparison of classifiers based on Accuracy

October Issue

Page 4 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 The Error rate of various classifiers are shown in Table 3 and the graphical representation is shown in Figure2. Table 3: Error rate of the classifiers Error Rate(%) Data Set C4.5 Pima-diabetes Heart-statlog Sick 2.62 2.33 1.19 CART 2.48 2.15 0.111 AD Tree 2.71 2.15 0.193 Random Forest 2.62 2.18 1.01

Figure 2: Graphical comparison of classifiers based on error rate

Table 4 shows the execution time of the classifiers to build the model. The Figure 3 shows the graphical representation of time complexity to build the classification models. Table 4: Execution time to build the model Execution time(Secs) Data Set C4.5 Pimadiabetes Heart-statlog Sick 0.19 0.09 0.86 CART 0.91 0.2 13.09 AD Tree 0.36 0.17 2.75 Random Forest 0.49 0.11 1.42

October Issue

Page 5 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011

Figure 3: Graphical comparison of classifiers based on time to build the models The experimental analysis shows that CART algorithm yields good accuracy compared to C4.5, AD Tree and Random Tree for both small and large datasets. Figure 3 shows that time taken by CART to build a model is more compared to the time taken by other classifiers on medical datasets. For the classification of medical data, accuracy is the most important key factor. Hence CART is the best inductive learning algorithm for diagnosing medical data.

5. Conclusions
In real world applications, data mining is increasing its popularity in almost all fields. Classification is one of the interesting topics for knowledge discovery as it accurately and efficiently classifies data. Since decision trees generate understandable rules and perform classification without much computation, it is popular among other classification methods. In this paper, the decision tree classifiers are studied and the experiments are conducted on three different datasets, Pima-Diabetes, Heart-statlog and Sick obtained from the UCI Machine Learning Repository. Accuracy and Error rate are validated by 10-fold cross validation method. The experimental result shows that CART is the best classifier for medical data and also observed that CART performs pretty well on large datasets.

6. References
[1] Kietikul Jearanaitanakij,Classifying Continous Data Set by ID3 Algorithm, Proceedings of fifth International Conference on Information Communication and Signal Processing, 2005. [2] Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kauffman Publishers, USA, 2006. [3] A.K. Pujari, Data Mining Techniques, University Press, India 2001. [4] Agrawal, R., Imielinski, T., Swami, A., Database Mining:A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993. [5] Chen, M., Han, J., Yu P.S., Data Mining: An Overview from Database Perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. [6] Quinlan, J. R. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [7] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31(2007) 249-268, 2007 [8] J. R. Quinlan. Improved use of continuous attributes in c4.5, Journal of Artificial Intelligence Research, 4:77-90, 1996. [9] J.R. Quinlan, Induction of decision trees, In Jude W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning, vol. 1, 1986, pp 81106.

October Issue

Page 6 of 85

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 [10] Salvatore Ruggieri, Efficient C4.5 Proceedings of IEEE transactions on knowledge and data Engineering, Vo1. 14,2,No.2, PP.438-444,20025 [11] UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets] [12] Xindong Wu, Vipin Kumar, J.Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S, Yu Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, Top 10 algorithms in data mining, Springer 2007 [13] Thair Nu Phusu, Survey Survey of Classification Techniques in Data Mining Multi Conference of Engineers and Computer Scientists, 2009 Vol I IMECS 2009, Hong Kong [14] Breiman L, Random Forests, Machine Learning, 2001 45(1), pp 5-32 [15] Mohammad Tari, Behrouz Minaei , Ahmad Farahi, Mohammad Niknam Pirzadeh ,Prediction of Students' Educational Status Using CART Algorithm, Neural Network, and Increase in Prediction Precision Using Combinational Model, IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011. [16] Witten, I. H., Frank, E. 2000. Data mining: Practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, San Francisco, CA. USA, 371 pp. [17] Suneetha N., Hari V. M. K. and Kumar V.S., Modified Gini Index Classification: A Case Study of Heart Disease Dataset, International Journal on Computer Science and Engineering, issue 6, vol. 2, pp. 1959-1965, 2010. [18] Utgoff, P., Berkman, N., Clouse, J. (1997), Decision Tree Induction Based on Efficient Tree Restructuring, Machine Learning, Volume 29, Issue 1, Pages: 5 44. [19] Anyanwu M and Shiva S, Application of Enhanced Decision Tree Algorithm to Churn Analysis, International Conference on Artificial Intelligence and Pattern Recognition(AIPR-09), Orlando Florida [20] Lewis, R.J. (200). An Introduction to Classification and Regression Tree (CART) Analysis, 2000, Annual Meeting of the Society for Academic Emergency Medicine, Francisco, California

Authors Profile
Ms. Angeline Christobel is working as an Asst. Professor in AMA International University, Bahrain. She is currently pursuing her research in Karpagam University, India. Her research interest is in Data mining, Web mining and Neural networks.

Ms. Usha Rani is working as an Asst. Professor in AMA International University, Bahrain. Her Research interest is in Data mining and Software Engineering.

Ms. Kalaichelvi Chandrahasan is working as an Assistant Professor in AMA International University, Bahrain. Her Research interest is in Data mining and Web mining.

October Issue

Page 7 of 85

ISSN 2229 5216

You might also like