You are on page 1of 5

EC-9560: DATA MINING

INDIVIDUAL CLASS PROJECT

JAYAKODY N. M.
2014/E/011
SEMESTER 7
12/02/2018
1. BRIEF DESCRIPTION ABOUT THE DATASET
The train data set contains 280 attributes which includes attributes like age, sex, height, weight,
etc. including the class label. Some of the attributes are defined as Numeric while the rest are
defined as Nominal. Numeric data include data which contain numeric values while the nominal
data contain values of either zero or one. Numeric data include, age, height, weight, etc. and
nominal data include sex, chDI_RRwaveExists, class label, etc.
The data set contains 16 classes numbered from 1 to 16. The train data set contains attribute details
of 480 patients. But certain attributes like T, P, QRST, etc. contain missing values.
The following table shows the number of instances of each class,
Table 1 Number of instances of each class
Class a b c d e f g h i j k l m n o p
Instances 221 43 14 15 13 24 3 2 8 44 0 0 0 4 4 21

2. BRIEF DESCRIPTION ON WHETHER IT IS CLASSIFICATION, REGRESSION OR


CLUSTERING
Our project intention is to predict the class of a set of data, which makes regression inapplicable
to achieve our target as Regression is a data mining technique that predicts a number rather than
predicting a class. The training data set used here is a supervised data set which include predefined
classes, in turn making clustering too to be inapplicable where clustering is a data mining technique
which includes the partitioning of a set of data in a set of meaningful sub-clusters called clusters
which is an unsupervised approach. Therefore, classification is used as the data mining technique
where it is a data mining technique that assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target class for each set of data. A
classification task is performed with a data set which has known class assignments.

3. BRIEF DESCRIPTION OF THE PRE-PROCESSING TECHNIQUE


As pre-processing I have chosen the filter as ‘ReplaceMissingValues’. As the data set contain
missing values in certain attributes, this pre-processing technique was used, which basically
involves to impute missing values with the mean of the numerical distribution.
4. BRIEF DESCRIPTION OF THE MACHINE LEARNING METHOD
SMO (Sequential Minimal Optimization)
This technique globally replaces all missing values and transforms the nominal attributes in to
binary. It also normalizes all attributes by default. The coefficients in the output are based on the
normalized data not the original data.

5. EVALUATION CRITERIA USED TO VALIDATE THE MODEL


The evaluation criteria to choose the model is the accuracy of prediction. The model with the
highest accuracy was finalized as the model. The following table shows the accuracy of a few of
the classifications tried out with cross validation – 10 folds.
Table 2 Accuracy of classifiers with cross validation - 10 folds
Classifier Accuracy
Naïve Bayes 61.2981 %
SMO 69.7115 %
J48 63.4615 %
Random Forest 64.1827 %
Naïve Bayes Updateable 61.2981 %
Therefore, SMO was chosen as the classifier as it showed higher accuracy than the rest of the
classifiers.

6. PARAMETERS OF THE MODEL AND TUNING PARAMETERS


The following are the parameters of the model,
Parameters:
buildLogisticModels: Whether to fit logistic models to the outputs
c: Complexity parameter C.
checksTurnedOff : Turns time-consuming checks off - use with caution.
Debug: If set to true, classifier may output additional info to the console.
Epsilon: The epsilon for round-off error.
filterType: Determines how/if the data will be transformed.
Kernel: kernel to use.
numFolds: Number of folds for cross-validation used to generate training data for
logistic models (-1 means use training data).
randomSeed: Random number seed for the cross-validation.
toleranceParameter: Tolerance parameter.

Tuning of the parameters were done after trial and comparing their accuracies. The highest
accuracy was brought by cross validation – 10 folds.
Table 3 Results with different test options
Test Option Accuracy
Cross Validation – 5 folds 69.2308 %
Cross Validation -10 folds 69.7115 %
Percentage split – 66% 64.539 %
Percentage split – 75% 66.3462 %

7. CROSS-VALIDATION
Cross validation is technique used to evaluate the predictive models by the use of partitioning the
original sample into training set and test set.
In k-fold cross validation, the original sample is randomly partitioned to ‘k’ equal size subsamples.
Out of these subsamples, one is retained as the validation data for model testing, and the remaining
k-1 subsamples as training data. This process is repeated ‘k’ times, with each subsample used as
validation data. The ‘k’ results can be averaged to produce a single estimation.
The purpose of using cross validation is to check the model but not model building.

5-cross validation

Figure 1 Summary of the classification with 5-cross validation


Figure 2 Confusion Matrix of the classification with 5-fold cross validation

8. PREDICTIONS OF THE TEST DATA SET


Class Prediction, 1 1:1, 2 10:10, 3 1:1, 4 1:1, 5 2:2,
6 1:1, 7 1:1, 8 1:1, 9 1:1, 10 1:1, 11 1:1,
12 10:10, 13 1:1, 14 10:10, 15 14:14, 16 1:1, 17 10:10,
18 3:3, 19 1:1, 20 1:1, 21 1:1, 22 1:1, 23 1:1,
24 1:1, 25 1:1, 26 1:1, 27 1:1, 28 10:10, 29 1:1,
30 1:1, 31 1:1, 32 1:1, 33 1:1, 34 8:8, 35 1:1,
36 1:1

ACCURACY
The number of correct predictions were counted using the given testlabel.arff file and the predicted
value shown above.

Number of correct predictions: 29

Total number of instances: 36

Accuracy: (29/36)*100 = 80.5556

You might also like