Professional Documents
Culture Documents
4, 2011
College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain angeline_christobel@yahoo.com College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain ama_usharani@yahoo.com
College of Computer Studies, AMA International University, Salmabad, Kingdom of Bahrain kalai_hasan@yahoo.com
Abstract
Classification is a data mining technique that is used to classify objects based on their features into a predefined category. Decision tree induction is one of the most popular classification algorithms in data mining and machine learning which represents the results in a tree scheme. In this research paper, the performance of decision tree induction classifiers such as C4.5, CART, AD Tree and Random Forest are analyzed on various medical data sets. The algorithms are evaluated based on Accuracy, Error rate and Execution time.
October Issue
Page 1 of 85
International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 segments. It shows a flow chart like structure where each internal node in the decision tree denotes a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label [2]. They are simple and very powerful for multiple variable analyses. It can handle high dimensional data and provide good accuracy. Decision tree induction algorithms have been used for classification in many application areas, such as medicine, manufacturing and production, financial analysis, astronomy and molecular biology [2]. The basis of several commercial rule induction systems are decision trees. The C4.5, CART, AD Tree and Random Forest algorithms are discussed below.
October Issue
Page 2 of 85
International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 The alternating decision tree can make use of all the weak hypotheses in boosting to arrive at a single, easily-understood representation.
3. Performance Evaluation
The performance of classifiers depends on the characteristics of the data to be classified. The different empirical tests that can be performed to compare the classifiers are holdout, random subsampling, k-fold cross validation and bootstrap methods. In this study, we have selected k-fold cross validation for evaluating the classifiers. In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subset or folds d1,d2,,dk, each approximately equal in size. The training and testing is performed k times. In the first iteration, subsets d2, , dk collectively serve as the training set in order to obtain a first model, which is tested on d1; the second iteration is trained in subsets d1, d3,, dk and tested on d2; and so no[2]. The accuracy of the classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data [2]. Performance of the selected algorithms is measured based on the Accuracy and Error rate from the confusion matrix obtained. The time taken to build the model is also considered for comparison. The Accuracy and Error is calculated as follows: Accuracy = (TP+TN) / (TP + FP + TN + FN) Error = (FP+FN) / (TP + FP + TN + FN) Where TP is the number of True Positives TN is the number of True Negatives FP is the number of False Positives FN is the number of False Negatives
October Issue
Page 3 of 85
4. Experimental results
In this paper, 10-fold cross validation is applied for evaluating the performance of the classifiers. These data mining classification model were developed using data mining classification tool Weka version 3.6. We have used three datasets Pima-Diabetes, Heart-statlog and Sick obtained from the UCI Machine Learning Repository [11]. Algorithm for attribute selection was applied on dataset to preprocess the data. Table1 shows the characteristics of datasets such as the number of instances, the number of attributes and the number of classes it contains. Table 1: Characteristics of datasets Dataset Pima-diabetes Heart-statlog Sick No. of Instances 768 270 3772 No. of Attributes 9 14 30 No. of Classes 2 2 2
Table 2 shows the accuracy of various classifiers and Figure 1 shows the accuracy in graphical format. Table 2: Accuracy of classifiers Accuracy (%) Dataset C4.5 73.83 76.66 98.81 CART 75.13 78.52 98.88 AD Tree 72.92 78.52 98.06 Random Forest 73.83 78.15 98.38
October Issue
Page 4 of 85
International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 The Error rate of various classifiers are shown in Table 3 and the graphical representation is shown in Figure2. Table 3: Error rate of the classifiers Error Rate(%) Data Set C4.5 Pima-diabetes Heart-statlog Sick 2.62 2.33 1.19 CART 2.48 2.15 0.111 AD Tree 2.71 2.15 0.193 Random Forest 2.62 2.18 1.01
Table 4 shows the execution time of the classifiers to build the model. The Figure 3 shows the graphical representation of time complexity to build the classification models. Table 4: Execution time to build the model Execution time(Secs) Data Set C4.5 Pimadiabetes Heart-statlog Sick 0.19 0.09 0.86 CART 0.91 0.2 13.09 AD Tree 0.36 0.17 2.75 Random Forest 0.49 0.11 1.42
October Issue
Page 5 of 85
Figure 3: Graphical comparison of classifiers based on time to build the models The experimental analysis shows that CART algorithm yields good accuracy compared to C4.5, AD Tree and Random Tree for both small and large datasets. Figure 3 shows that time taken by CART to build a model is more compared to the time taken by other classifiers on medical datasets. For the classification of medical data, accuracy is the most important key factor. Hence CART is the best inductive learning algorithm for diagnosing medical data.
5. Conclusions
In real world applications, data mining is increasing its popularity in almost all fields. Classification is one of the interesting topics for knowledge discovery as it accurately and efficiently classifies data. Since decision trees generate understandable rules and perform classification without much computation, it is popular among other classification methods. In this paper, the decision tree classifiers are studied and the experiments are conducted on three different datasets, Pima-Diabetes, Heart-statlog and Sick obtained from the UCI Machine Learning Repository. Accuracy and Error rate are validated by 10-fold cross validation method. The experimental result shows that CART is the best classifier for medical data and also observed that CART performs pretty well on large datasets.
6. References
[1] Kietikul Jearanaitanakij,Classifying Continous Data Set by ID3 Algorithm, Proceedings of fifth International Conference on Information Communication and Signal Processing, 2005. [2] Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kauffman Publishers, USA, 2006. [3] A.K. Pujari, Data Mining Techniques, University Press, India 2001. [4] Agrawal, R., Imielinski, T., Swami, A., Database Mining:A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, pp. 914-925, December 1993. [5] Chen, M., Han, J., Yu P.S., Data Mining: An Overview from Database Perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 8 No.6, December 1996. [6] Quinlan, J. R. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [7] S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31(2007) 249-268, 2007 [8] J. R. Quinlan. Improved use of continuous attributes in c4.5, Journal of Artificial Intelligence Research, 4:77-90, 1996. [9] J.R. Quinlan, Induction of decision trees, In Jude W.Shavlik, Thomas G. Dietterich, (Eds.), Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning, vol. 1, 1986, pp 81106.
October Issue
Page 6 of 85
International Journal of Advances in Science and Technology, Vol. 3, No.4, 2011 [10] Salvatore Ruggieri, Efficient C4.5 Proceedings of IEEE transactions on knowledge and data Engineering, Vo1. 14,2,No.2, PP.438-444,20025 [11] UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets] [12] Xindong Wu, Vipin Kumar, J.Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S, Yu Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, Top 10 algorithms in data mining, Springer 2007 [13] Thair Nu Phusu, Survey Survey of Classification Techniques in Data Mining Multi Conference of Engineers and Computer Scientists, 2009 Vol I IMECS 2009, Hong Kong [14] Breiman L, Random Forests, Machine Learning, 2001 45(1), pp 5-32 [15] Mohammad Tari, Behrouz Minaei , Ahmad Farahi, Mohammad Niknam Pirzadeh ,Prediction of Students' Educational Status Using CART Algorithm, Neural Network, and Increase in Prediction Precision Using Combinational Model, IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011. [16] Witten, I. H., Frank, E. 2000. Data mining: Practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, San Francisco, CA. USA, 371 pp. [17] Suneetha N., Hari V. M. K. and Kumar V.S., Modified Gini Index Classification: A Case Study of Heart Disease Dataset, International Journal on Computer Science and Engineering, issue 6, vol. 2, pp. 1959-1965, 2010. [18] Utgoff, P., Berkman, N., Clouse, J. (1997), Decision Tree Induction Based on Efficient Tree Restructuring, Machine Learning, Volume 29, Issue 1, Pages: 5 44. [19] Anyanwu M and Shiva S, Application of Enhanced Decision Tree Algorithm to Churn Analysis, International Conference on Artificial Intelligence and Pattern Recognition(AIPR-09), Orlando Florida [20] Lewis, R.J. (200). An Introduction to Classification and Regression Tree (CART) Analysis, 2000, Annual Meeting of the Society for Academic Emergency Medicine, Francisco, California
Authors Profile
Ms. Angeline Christobel is working as an Asst. Professor in AMA International University, Bahrain. She is currently pursuing her research in Karpagam University, India. Her research interest is in Data mining, Web mining and Neural networks.
Ms. Usha Rani is working as an Asst. Professor in AMA International University, Bahrain. Her Research interest is in Data mining and Software Engineering.
Ms. Kalaichelvi Chandrahasan is working as an Assistant Professor in AMA International University, Bahrain. Her Research interest is in Data mining and Web mining.
October Issue
Page 7 of 85