Professional Documents
Culture Documents
Abstract. This paper discusses data mining techniques to process a medical dataset and identify the relevance of liver disorder and drinking alcohol drink by classification of blood test data. We have used four different classification methods including decision tree, Bayesian algorithms (Naive Bayes and Bayesian Networks), Neural Network classification and Rough Sets methods. To evaluate the methods, we have used the Waikato Environment for Knowledge Analysis (WEKA) open source tool. WEKA is a collection of machine learning algorithms that can be used for different processing tasks such as classification, and clustering. Bayesian algorithms and Neural Network classification methods are implemented with WEKA. However, as WEKA does not support methods based on Rough Sets we have used Rosetta. We have provided an evaluation based on applying these classification methods to our dataset and measuring the accuracy of test results. The evaluation results show that using Neural Networks obtains the best result among the other methods. Keywords: Classification, Decision tree methods, Bayesian algorithms, Neural Network, Rough Sets.
1. Introduction
In this paper we process a blood test dataset and use different classification methods to learn from the test data set and develop a system that is able to identify the existing of a liver disorder by processing the blood test data. The Liver disorder dataset consists of a set of blood test results that are used to describe the liver disorders that might arise from extreme alcohol consumption and find any relationship between alcohol consumption and liver disease. This dataset has some attributes that show values depend on red blood cell volume, hydrolase enzyme, Serum glutamate primitive transaminase; that high level of this kind of enzyme released into the blood may be a sign of liver damage, Aspartate transaminase that it is another enzyme associated with liver parenchyma cells, Gamma-glutamyl transpeptidase An enzyme that catalyzes the transfer of a -glutamyl group from glutathione or -glutamyl peptide to another peptide or amino acid; that high level value is the best single screening assay for detecting latent or chronic liver disease. The Waikato Environment for Knowledge Analysis (WEKA) is a collection of machine learning algorithms that can be used for processing tasks such as classification and clustering. In this paper, we use decision tree methods that contain J48 that is implemented of C.4.5 and LMT type. Bayes algorithms and Neural Network classification methods, Naive Bayes and Bayesian methods (i.e. MLP and RBF) are implemented in WEKA. We used Rosetta for implementing Rough Set methods. We divided our data set to training data (66%) and all of methods tested and compared according 34% of this data.
This section describes the classification methods are used in this paper. We discuss each method and explain how the method has been used in our experiment. Decision tree Decision Trees (DT) tree learning algorithms work based on processing and deciding upon attributes of the data. Attributes in DT are nodes and each leaf node is representing a classification. Two algorithms namely J48 and LMT were used in our experiments [1-2]. LMT is a method based on DT that final nodes are replaced with logistic regression functions. After modelling result, comparison of results analyzes shows the approach that provides a better performance. J48 is an implementation of C4.5 in WEKA. C4.5 uses information entropy concept [3]. The disadvantages of DT are focus on continues attributes, computational efficiently with growing tree size. According to comparison provided for different classification methods in emotion recognition [4], DT is the best classifier method on that group with 15 attributes. Artificial Neural Networks Artificial Neural Networks (ANN) are one of the common classification methods in data mining. To employ Neural Network based classifiers, Multi Layer Perceptron (MLP) and Radial Base Function (RBF) were used in this work. MLP is a feed forward network that makes a model to map input data to output data. Hidden layer in MLP can include various layers between input and output. The structure of MLP is shown in Figure 1. RBF is another type on ANN. The input of NN in RBF is linear and the output is nonlinear. The output of this type of ANN is taken from weighted sum of hidden layers output. The RBF networks are divided in two feed-forward layer. Figure 2 is illustrates the structure of this networks (The figure are adapted from [5]).
2.1.
2.2.
In [5], Avellaneda, et al compared four types of Neural Network classifiers. The experiment data included 700 records with nine attributes. Avellaneda, et al reported that that MLP obtained the best result in their experiment. Rough Sets Rough set method is another classification technique in data mining. This section describes some basic concepts and features of rough set theory that are important to describe the classification method. A suite of methodologies that is useful for characterization of data that is not accurate has been provided by Rough set theory. One of the main aspects in rough set methods is providing a formal framework for automated transformation of data into knowledge for changing the modules [6]. Rough set theory offers mathematical tools to discover hidden patterns in data that can be used in data mining. The main goal of the rough set analysis is providing an estimate of data [7]. The rough set theory has been used for several types of data with various sizes. In [7] a Rough Set method is tested on small, medium and large size of data. The results show that Rough Sets are useful for small and medium size datasets [7]. As the dataset that we have used for liver diseases is small size, we also consider applying Rough Set based classification in our evaluations. Bayesian methods Bayesian methods are also used as one of the classification solutions in data mining. In our work we use two main Bayesian methods namely naive Bayes and Bayesian networks that are implemented in WEKA software for classification [8].
2.3.
2.4.
63
A naive Bayes classifier could be defined as an independent feature model deals with a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. There are several models that make different assumption fitting for Nave Bayes [9] [10].
J48 53.28% 57.03% 54.85% 57.14% 64.49% 60.74% 66.33% 63.23% 79.41%
LMT 55.59% 59.62% 56.54% 69.45% 71.00% 70.37% 59.40% 69.17% 74.47%
RBF 61.84% 62.96% 64.13% 67.98% 65.68% 62.96% 67.32% 67.64% 79.41%
Fig 3. The accuracy is increased fluctuated during raising training set size
We have also considered different attributes and their affected in liver disorder dataset classification. According to the attribute selection methods such as best first and ranker methods, some attributes including SGOT, ALK, MCV and SGPT are removed and then the classification methods are applied. The result of this experiment is shown in Table 2. Table 2. Accuracy of all methods in different feature sizes Number Of features 3 4 5 6 7 J48 76.47% 82.35% 79.41% 79.41% 79.41% LMT 76.47% 59.40% 82.35% 73.52% 74.47% Bayes Net 76.47% 82.35% 76.47% 76.47% 76.47% Nave Bayes 76.47% 82.35% 76.47% 76.47% 76.47% MLP (5 hidden layer) 76.47% 76.47% 88.23% 91.17% 79.41% RBF 73.52% 79.41% 79.41% 82.35% 79.41% Rough Set 58.82% 61.76% 64.70% 73.52% 76.47%
64
As shown in Fig.4 there was not a clear effect in accuracy of some of the methods with increasing feature sizes; however the performance of MLP is raised sharply by using only 6 features. MLP showed high accuracy at 91.17 %; in addition RBF had a slight improvement by using only 6 features. As we can see, there is not significant change in Nave Bayes, Bayes Net, J48 and LMT methods by increasing the number of features.
Figure 4 demonstrate the the accuracy of the evaluations by applying different numver of attributes. This justifies that feature size of data set have important role in accuracy and in particular is significantly affects the performance of the Rough Set method.
5. References
[1] K. Golnabi, et al., "Analysis of firewall policy rules using data mining techniques," 2006, pp. 305-315. [2] H. Li, et al., "Data Mining Techniques for Complex Formation Evaluation in Petroleum Exploration and
Production: A Comparison of Feature Selection and Classification Methods," 2008, pp. 37-43.
[3] J. Liang and Z. Shi, "The information entropy, rough entropy and knowledge granulation in rough set
theory," International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, vol. 12, pp. 3746, 2004.
[4] T. Justin, et al., "Comparison of different classification methods for emotion recognition," pp. 700-703. [5] D. A. Avellaneda, et al., "Natural Texture Classification: A Neural Network Models Benchmark," 2009,
pp. 325-329.
[6] J. F. Peters and S. Ramanna, "Towards a software change classification system: A rough set approach,"
65
[9] G. Qiang, "An Effective Algorithm for Improving the Performance of Naive Bayes for Text
http://archive.ics.uci.edu/ml/datasets/Liver+Disorders
66