You are on page 1of 11

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 5, May - 2015. ISSN 2348 4853

CFS based Feature Subset Selection for Software


Maintainance Prediction
Shafali Manchanda, Anuradha Chug
University School of Information and Communication Technology, Guru Gobind Singh, Indraprastha
University, Dwarka, New Delhi, India
shafalimanchanda@gmail.com, a_chug@yahoo.co.in
ABSTRACT
Feature subset selection is the process of picking a subset of significant features for use in model
construction. Software engineers use numerous software metrics to analyze the characteristics of
software for utilizing them in categorization and prediction. All metrics do not carry equal
significance and utilizing all of them for analysis will not only affect budget but a lot of time and
effort will be wasted. In this study our main concern is to reduce the number of metrics needed to
predict change required to improve structural quality using feature subset selection technique.
To analyze the results of feature subset selection techniques and identify the best technique in
various circumstances, several machine learning algorithms are used. The impact of all the
combinations of various feature subset selection techniques and machine learning algorithms is
also compared for best results. This paper will help software engineers to predict the change
requirement to improve the structural quality of the software using a small set of relevant
software metrics.
Index Terms: Software Metrics, Software Maintainability, Structural Quality, Feature Subset
Selection, Machine Learning, Error Metrics.

I.

RESEARCH QUESTIONS
1. Identify the set of metrics that can predict change required to improve structural quality of a
software most significantly.
2. Identify the best Feature Subset Selection technique under various circumstances.
3. Identify the best Feature Subset Selection technique and Machine learning algorithm combination
under various circumstances.

II. INTRODUCTION
Software maintenance, which starts after delivery of the project, is needed either due to functional
requirement or to improve the structure of the software. In this paper we will focus on change needed to
improve the structural quality of the software. Software metrics are the quantitative measures that
describe the functional and structural properties of a software system or a process. Software metrics are
used to predict various characteristics of the software, such as change requirement. Thus we need
various machine learning algorithms which use general inductive processes to predict the change needed
on the basis of pre identified changes automatically. A large set of software metrics that label structural
properties can be used as input to these machine learning algorithms that overruns the processing time
and budget constraints. To shorten this large set of metrics, feature subset selection technique (FSS) can
be used that finds those metrics which can be ignored in our forecasting process without disturbing the
13 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

precision of our prediction. FSS is a data preprocessing technique in which redundant, irrelevant,
erroneous and missing data are removed. In this study we are using five FSS algorithms and four machine
learning algorithms to identify smallest but most efficient set of metrics.
To perform this study two consecutive versions of an open source medium sized software ORDrumbox,
downloaded from http://sourceforge.net/ are used.
III. RELATED WORK
Machine learning is a field of computer science that is based on the study of computational learning
theory and pattern recognition and explores the creation and study of algorithm. Various empirical
studies have been conducted in this field to propose the importance and application of machine learning
algorithms in various fields. Witten, Frank and Hall [10] have described various machine learning tools
and techniques in their book. Andrieu et. al. [11] have introduced an algorithm known as Markov chain
Monte Carlo (MCMC), which is an instance of a large class of sampling algorithms for machine learning.
Kubat et. al. [12] have used machine learning algorithms to identify oil spills in satellite radar images.
Similarly Freitag [13] has studied how learning can be used to extract information from domains where
linguistic processing can be a problem. To analyze the systems where large set of attributes affect their
performance, machine learning algorithms alone might not give much significant results in the time
constraints faced by analysts. To overcome this problem FSS techniques can be merged with machine
learning algorithms to improve performance.
FSS supports human readers to comprehend a learnt model and can severely reduce the search space for
a learner. Various studies have shown that a learner can overlook many attributes with little or no
damage to accuracy precision. The strengths and weaknesses of the wrapper methodology are discussed
by Kohavi and John[3] and a series of improved designs are shown. On the other hand, feature selection
problem using a greedy least squares regression algorithm is studied by Zhang [4]. He has shown that
under a definite irrepresentable state of the design matrix, the greedy algorithm can select features
consistently when the sample size tends to infinity. Dash and Liu [5] have executed feature subset
selection on the basis of consistency. They compared inconsistency measure with other measures and
studied various search techniques such as exhaustive, complete, heuristic and random search.
Classification and Regression Trees approach (CART) is used by Bittencourt and Clarke [6] for feature
selection. A new method for feature subset selection using the TAR2 treatment learner is presented by
Gunnalan et. al. [7]. Yang and Honavar [8] have presented an approach to multi-criteria optimization
problem of feature subset selection using genetic algorithm. They demonstrated the possibility of this
approach for FSS in automated design of neural network for pattern classification and knowledge
discovery. Another method for feature subset selection which is FSS-EBNA (Feature Subset Selection by
Estimation of Bayesian Network Algorithm) is proposed by Inza et. al. [9]. They have used a wrapper
approach over Naive-Bayes and ID3 learning algorithms to estimate the goodness of each obtained
solution.
In this study we have classified software metrics using various feature subset selection techniques and
then the results are reviewed using various machine learning algorithms.
IV. RESEARCH METHODOLOGY
This section describes various steps that have been followed in this empirical study as shown in figure 1.
To carry out this research two consecutive versions of an open source medium sized software
ORDrumbox, version 0.9.082 and version 0.9.07, downloaded from http://sourceforge.net/ are used and
various software metrics are calculated. Thereafter five FSS algorithms on the obtained metrics are
applied and their results are recorded. Afterwards four machine learning algorithms are applied to
14 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

evaluate the error while predicting change using the set of metrics obtained by FSS algorithms. After
collecting all the results, analysis is done to identify metrics that constitute the smallest possible set to
determine and control the need for new versions to improve the structural quality of the software. All
metrics, FSS algorithms and machine learning algorithms used in this study are described in following
subsections.

Fig 1. Block Diagram representing various steps followed in this study


A. Software Metrics

Software metrics are the measures that provide information about physical and functional properties of a
system, component or process. Software project managers ordinarily use several software metrics to
support in the design and implementation of huge softwares [1, 2, 14, 15]. Various metrics used in this
project are mentioned in table 1.
Table 1. Software Metrics

S. No.
1.
2.
3.

4.

5.

6.

7.
8.

Software Metrics

Definition

WMC - Weighted
methods per class

Weighted methods per class (WMC) metric is defined as the sum of the
complexities of all the methods of a class.

DIT - Depth of
Inheritance Tree
NOC - Number of
Children
CBO - Coupling
between object
classes
RFC - Response for
a Class
LCOM - Lack of
cohesion in
methods
Ca - Afferent
couplings
Ce - Efferent
couplings

This metric is a measure of the inheritance levels of each class in the object
hierarchy.
This metric is defined as the number of immediate descendants of the class.
This metric counts the number of classes coupled by various means such as
inheritance, method calls, field accesses, return types, arguments, etc.
This metric counts the number of different methods that can be invoked by a
class.
A class's LCOM metric counts the sets of methods that are not linked with
each other by sharing class's fields.
Afferent coupling counts the number of classes that use a specific class.
Efferent coupling counts the number of classes that are used by a specific
class.

15 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

9.

NPM - Number of
Public Methods

10.

LCOM3 -Lack of
cohesion in
methods

11.
12.
13.

14.

15.

16.
17.
18.

19.

20.

21.
22.
23.
24.
25.
26.
27.
28.
29.

The NPM metric simply measure number of methods in a class that are
declared public.
It gives measure of cohesion in the range of 0-2 and is calculated as:

Where, p= number of procedures in a class,


k= number of variables in a class,
M(K)= total methods that use a variable.
It is calculated as the sum of total number of fields, methods and
LOC - Lines of Code
instructions in each method as counted in java code.
DAM: Data Access This metric is obtained by computing the ratio of the number of private
Metric
fields to the total number of fields in a class. Its value ranges from 0 to 1.
MOA: Measure of
This metric counts the number of data declarations whose types are user
defined classes.
Aggregation
MFA: Measure of
This metric computes the ratio of the count of methods inherited by a class
Functional
to the total count of methods accessible. Its value ranges from 0 to 1.
Abstraction
CAM: Cohesion
This metric computes the connection among methods of a class on the basis
Among Methods of of parameter list of methods. Its value ranges from 0 to 1 and a value close
Class
to 1.0 is preferred.
IC: Inheritance
This metric counts the number of parent classes coupled to a given class.
Coupling
CBM: Coupling
The metric counts the total number of new or redefined methods coupled
Between Methods with all the inherited methods.
AMC: Average
This metric calculates the average method complexity for each class.
Method Complexity
MVG: McCabe's
This metric gives a measure of a body of code based on analysis of the
Cyclomatic
cyclomatic complexity of the directed acyclic graph which represents the
Complexity
flow of control within each function.
This metric is the extent of commenting within a region of code. This metric
COM: Comment
is not very meaningful in isolation, but sometimes is used in ratio with LOC
Lines
or MVG to ensure that comments are distributed proportionately to the bulk
or complexity of a region of code.
NOM: Number of
This metric is the number of modules identified in the project.
modules
This is a measure of the number of non-blank and non-comment lines of
REJ: Rejected lines
code which were not successfully analyzed by the parser.
L_C
This metric is calculated as LOC/COM.
M_C
This metric is calculated as MVG/COM.
ALOC
This metric is Average LOC per method in a class.
AMVG
This metric is average MVG per method in a class.
ACOM
This metric is average number of comment lines per method in a class.
IF4
This metric calculate the information flow measures.
This metric is a variant of IF4 which is calculated using only relationships in
IF4V (visible)
the visible part of the module interface.

16 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

This metric is a variant of IF4 which is calculated using only those


30. IF4C (concrete)
relationships which imply that changes to the client must be recompiled of
the supplier's definition changes.
Apart from all these metrics two more metrics were used in this project which are the numeric change
metric and binary change metric. Numeric change metric was labelled CHANGE and binary change
metric was labelled BCHANGE. These metrics calculate change in two consecutive versions of a
software as below:
CHANGE= No. of lines inserted+ No. of lines deleted+ 2*(No. of lines modified)
BCHANGE= 1 if CHANGE>0; 0 if CHANGE=0.
B. Empirical Data Collection

To carry out this we have calculated various metrics, as mentioned earlier for ORDrumbox software
using CKJM1 tool and CCCC2 tool. After calculating all these metrics for the software, the numeric change
metric and binary change metric were calculated for each class by analyzing two consecutive versions of
ORDrumbox. Various properties of all the metrics for thus obtained are shown in table 2.
Table 2. Properties of Metrics calculated for ORDrumbox version 0.9.082

Software
Metric
WMC
DIT
NOC
CBO
RFC
LCOM
Ca
Ce
NPM
LCOM3
LOC
DAM
MOA
MFA
CAM
IC
CBM
AMC
MVG
COM
NOM
L_C
M_C
IF4
IF4V
IF4C

Mean

Median

9.7051
1.6959
0.3825
10.2627
28.1705
111.1705
6.4931
4.6129
7.6866
1.1103
231.6452
0.6132
0.7143
0.2395
0.5663
0.1705
0.3226
19.3956
9.2212
12.8387
9.5069
12.8772
3.3813
9.7512
9.7512
0

4
1
0
5
15
3
2
2
3
0.9656
85
0.6047
0
0
0.5556
0
0
14.625
2
4
4
7.0275
0.3
0
0
0

Standard
Deviation
13.60643
1.8408
2.36805
13.97154
31.59676
410.6652
12.18876
5.94943
12.58613
0.62388
349.4495
0.05714
2.68939
0.41434
0.27528
0.44457
1.23869
22.16497
16.35874
22.76097
21.55916
22.88966
26.50979
143.6434
143.6434
0

Variance

Minimum

Maximum

Range

185.135
3.389
5.608
195.204
998.355
168645.9
148.566
35.396
158.411
0.389
122114.9
0.003
7.233
0.172
0.076
0.198
1.534
491.286
267.608
518.062
464.797
523.937
702.769
20633.44
20633.44
0

1
0
0
0
1
0
0
0
0
0
1
0.5
0
0
0
0
0
0
0
0
-17
0
0
0
0
0

87
6
23
75
226
3705
68
31
85
2
2558
1
33
1
1
3
15
175.17
132
148
196
196
333
2116
2116
0

86
6
23
75
225
3705
68
31
85
2
2557
0.5
33
1
1
3
15
175.17
132
148
213
196
333
2116
2116
0

http://www.spinellis.gr/sw/ckjm/
http://sourceforge.net/projects/cccc/
17 | 2015, IJAFRC All Rights Reserved
2

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

REJ
ALOC
AMVG
ACOM
CHANGE
BCHANGE
C.

10.4101
17.6959
1.6348
2.3241
14.2829
0.5073

8
12.571
0.375
0.75
1
1

8.48214
22.03114
3.59524
5.27281
34.87173
0.50117

71.947
485.371
12.926
27.802
1216.037
0.251

1
1.33
0
0
0
0

52
257
39.5
49.33
319
1

51
255.67
39.5
49.33
319
1

Feature Subset Selection

Feature Subset Selection (FSS) technique is the process of choosing a subset of relevant features. There
are various algorithms to apply FSS on a set of features. In this project weka3 (Waikato Environment for
Knowledge Analysis) [16, 17] has been used to apply FSS to the set of metrics discussed in previous
subsection. The algorithms used in this study are described in table 3.
Table 3. FSS Algorithms and their Description

S. No.

1.

2.

3.

4.

5.

FSS Algorithm

Description
CFsSubsetEval function in Keel provides this function. It calculates the worth
of a subset of attributes on the basis of individual predictive capability of
Correlation based
every feature and the degree of redundancy among them. While using this
Algorithm
algorithm, subsets of features that are highly correlated with the class but
have low intercorrelation are chosen. [18]
GainRatioAttributeEval function in Keel performs on the basis of this
technique. It calculates the worth of an attribute by determining the gain
Gain Ratio
ratio with respect to the class.
GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute). [10]
InfoGainAttributeEval function in Keel performs on the basis of this
technique. It evaluates the worth of an attribute by determining the
Information Gain
information gain with respect to the class. [10]
InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).
OneRAttributeEval function in keel evaluates the worth of an attribute by
OneR
using the OneR classifier. [10]
SymmetricalUncertAttributeEval function in Keel performs on the basis of
this technique. It evaluates the worth of an attribute by determining the
Symmetrical
symmetrical uncertainty with respect to the class. [10]
Uncertainty
SymmU(Class, Attribute) = 2 * (H(Class) - H(Class | Attribute)) / H(Class) +
H(Attribute).

Out of all these FSS algorithms, Correlation based Algorithm was used with Change metric as
argument and the rest four were used with BChange metric as discussed earlier.
D. Machine Learning Algorithm

Machine learning helps in the creation and study of algorithms that can learn from data. Such algorithms
build a model from input data provided and use that model to make decision or predictions. We have
used weka for applying four machine learning algorithms [16.17] each one on every subset selected by all

http://www.cs.waikato.ac.nz/ml/weka/downloading.html
18 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

five FSS techniques thereby giving us (5*4) twenty different combinations and thus twenty outputs. The
learning algorithm used in this study are as described in table 4.
Table 4. Machine Learning Algorithms and their Description

S. No. Machine Learning Algorithm


1.

MultiLayer Perceptron

2.

K*

3.

Nearest Neighbor Algorithm

4.

LWL

Description
This is a classifier that uses backpropagation to classify instances.
This network can be constructed by hand or produced by an
algorithm or both. The network can also be observed and changed
during training time. [10]
K* or KStar [20] is an instance-based classifier, i.e. the class of a
test instance is based upon the class of another training instances
similar to it and similarity is identified by some similarity
function. It differs from other instance-based learning algorithms
in that it uses an entropy-based distance function.
In this algorithm we have used IBK function which is a K-nearest
neighbor classifier and select appropriate value of K based during
cross-validation. It can also do distance weighting. [19]
LWL (Locally Weighted Learning) technique uses an instancebased algorithm to assign instance weights which are then used
by a specified WeightedInstancesHandler [21].

Measure used to analyze Error in Prediction


In this study we have considered four different measures to calculate Error in the prediction of Change
and BChange variable while analyzing the results of all the combination on FSS and machine learning
techniques and these measures are described in table 5.
E.

Table 5. Error Measures and their Description

S. No.

Measure to
calculate Error

Description
It measures the average of absolute error and is calculated as:

1.

2.

3.

4.

Mean Absolute
Error (MAE)

Root Mean
Square Error
(RMS)

Where, P(i) is predicted value and T(i) is the true value.


It represents the sample standard deviation of the differences between
predicted values and observed values and is calculated as:

The relative absolute error takes the total absolute error and normalizes it by
Relative Absolute dividing by the total absolute error of the simple predictor. It is calculated as:
Error (RAE)
Root Relative
Square Error
(RRS)

It takes the total squared error and normalizes it by dividing by the total
squared error of the simple predictor. It is calculated as:

V. RESULTS AND ANALYSIS


19 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

In this study all the reported results are based on 10-fold cross-validation for each algorithms.
RQ1: Identify the set of metrics that can predict change required to improve structural quality of
software most significantly.
As shown in table 6 after applying Correlation based algorithm RFC, Ce, COM, NOM, REJ and AMVG are
found to be most significant metrics whereas after applying gain ratio based FSS REJ, COM, RFC, L_C,
ACOM, M_C, MVG and AMVG are found to be more significant metrics than others. Similarly, REJ, NOM, Ce,
NOM, RFC, L_C, ACOM, M_C and MVG are found to be more significant metrics for information gain
method of FSS, REJ, ACOM, RFC, M_C, LCOM3, LOC, COM, NOM, AMVG, MVG and L_C metrics for OneR
method of FSS and REJ, COM, RFC, L_C, ACOM, M_C, Ce, MVG and NOM metrics for symmetrical
uncertainty method of FSS.
Table 6. Metrics subset generated by all the Feature Subset Selection Algorithms

Correlation based Algorithm RFC


Ce COM NOM REJ AMVG
Gain Ratio
REJ COM RFC L_C ACOM M_C MVG AMVG
Information Gain
REJ NOM Ce NOM RFC L_C ACOM M_C MVG
OneR
REJ ACOM RFC M_C LCOM3 LOC COM NOM AMVG MVG L_C
Symmetrical Uncertainty
REJ COM RFC L_C ACOM M_C Ce MVG NOM
We observed that RFC and REJ were found to be most significant by all FSS techniques and COM, NOM,
L_C, ACOM and MVG were picked up by four methods as significant metrics. Thus we conclude that most
significant metrics after application of FSS are RFC, REJ, COM, NOM, L_C, ACOM and MVG.
We will further verify our result by using it in prediction process. In the prediction process we will use
the whole data and the selected data and will analyze whether given metrics can significantly capture the
characteristics of data or not.
RQ2: Identify the best Feature Subset Selection technique under various circumstances.
After applying all FSS techniques, four Machine Learning algorithms are applied on the data sets
generated by all five FSS algorithms and the original dataset. The results showing MAE in prediction of
numeric change and binary change are shown in table 7. For error measurement, in table 8 we used RMS
metric, whereas in table 9 we used RAE metric, and in table 10 we used RRS metric. On the basis of table
7, we can conclude that OneR based technique for FSS is best suited if MAE is used as measure for
detection of error while making prediction. Similarly from table 8 we can conclude that OneR FSS method
is best suited in situation where RMS is used as measure for error detection. According to table 9 OneR
FSS is optimal for the circumstances where RAE is used as error measure. Table 10 conclude that
correlation based FSS is best when RRS is used as measure for error detection.
Table 7. MAE in prediction of Numeric and Binary Change Attributes

FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)

20 | 2015, IJAFRC All Rights Reserved

Nearest Neighbor
Algorithm
17.8681
0.2467
0.044
0.0453
0.027
0.0282
0.027

K*

LWL

16.7613
0.2312
0.397
0.0598
0.0442
0.0319
0.0442

18.964
0.2434
15.1934
0.2365
0.2362
0.2379
0.2362

Multilayer
Perceptron
41.2443
0.2393
13.7241
0.1639
0.1522
0.1436
0.1647

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
Table 8. RMS in prediction of Numerical Change Attribute

FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)

Nearest Neighbor
Algorithm
47.2241
0.4893
0.1984
0.141
0.1031
0.1058
0.1031

K*

LWL

42.8919
0.449
1.1772
0.1638
0.1366
0.1191
0.1366

38.2315
0.3512
29.3743
0.3405
0.3424
0.3409
0.3424

Multilayer
Perceptron
87.502
0.4484
27.3932
0.2996
0.2918
0.2654
0.2865

Table 9. RAE in prediction of Numerical Change Attribute

FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)

Nearest Neighbor
Algorithm
83.3185
49.4126
0.2059
9.0656
5.4185
5.6517
5.4185

K*

LWL

78.1576
46.299
1.8587
11.9788
8.8508
6.3845
8.8508

88.4288
48.743
71.1378
47.386
47.3143
47.6559
47.3143

Multilayer
Perceptron
192.3218
47.9335
64.2586
32.8358
30.4991
28.76
32.9908

Table 10. RRS in prediction of Numerical Change Attribute

FSS
Original Dataset(CHANGE)
Original Dataset(BChange)
Correlation based Algorithm(CHANGE)
Gain Ratio(BCHANGE)
Information Gain(BCHANGE)
OneR(BCHANGE)
Symmetrical Uncertainty(BCHANGE)

Nearest Neighbor
Algorithm
121.5123
75.6944
0.5135
28.2226
20.6298
21.1845
20.6298

K*

LWL

110.3649
89.8598
3.0473
32.7936
27.346
23.8333
27.346

98.3735
70.2922
76.0392
68.1604
68.5395
68.2302
68.5395

Multilayer
Perceptron
225.151
89.7419
70.9109
59.976
58.4075
53.133
57.3414

RQ3: Identify the best Feature Subset Selection technique and Machine learning algorithm combination
under various circumstances.
As we can see in table 7, when MAE is used as measure for Error Estimation, OneR FSS method used with
K* learning gives minimum error. While using RMS as measure for error estimation, as in table 8, nearest
neighbor based learning used either with symmetric uncertainty based FSS or information gain based
FSS method gives minimum error in prediction. If RAE is used as measure to calculate error then
according to table 9, correlation based FSS along with nearest neighbor based learning gives best results.
If RRS is used as a measure to calculate error then according to table 10, correlation based FSS along with
nearest neighbor based learning gives more accurate results.
VI. CONCLUSION AND FUTURE SCOPE
As we have already seen, the error in change prediction is reduced after applying various FSS techniques.
Apart from the chances of wrong prediction, the effort required to analyze smaller set of metrics will be

21 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

much lesser than that required to analyze all the metrics to estimate the change for a large scale project.
As per the results collected in this study we can conclude that:
1. RFC, REJ, COM, NOM, L_C, ACOM and MVG are the most significant metrics that can predict the
structural quality of a software.
2. After applying FSS using OneR we observed that prediction accuracy is improved significantly
with every machine learning algorithm when MAE, RAE and RMS prediction accuracy measures
are used.
3. Correlation based technique for FSS gives more accurate results if RRS is used as measure for
detection of error.
4. If MAE is used as a measure to calculate error then OneR based FSS used along with K* learning
gives more accurate results.
5. If RMS is used as a measure to calculate error then nearest neighbor based learning used either
with symmetric uncertainty based FSS or information gain based FSS method gives more accurate
results.
6. We observed that correlation based FSS used along with nearest neighbor based learning gives
more accurate results when RAE and RRS are used as measure to calculate error.
This study finds out that correlation based technique i.e. CFS is best technique for FSS when machine
learning is applied for prediction. However more study is needed to carried out to verify the results. In
this paper we have taken 30 metrics for analysis and their impact is analyzed by an automated tool. So
one of the future scope can be to include a wider range of metrics for the study. In order to obtain more
generalized results we can include more open source softwares for analysis. Moreover, more FSS and
machine learning algorithms can be used for better analysis. Genetic algorithms can also be used to get
wider range of results.
VII.

REFERENCES

[1]

Kan S.H., Metrics and Models in Software Quality Engineering, Addison-Wesley Publishing
Company, Reading, Massachusetts, USA, 1995.

[2]

Fenton N.E., Pfleeger S.L., Software Metrics: A Rigorous and Practical Approach, 2nd Edition. PWS
Publishing Company, Boston, USA, 1997.

[3]

Kohavi R., John G. H., Wrappers for feature subset selection, Artificial Intelligence 97, pp.273-324,
1997.

[4]

Tong Zang, On the Consistency of Feature Selection using Greedy Least Square Regression, Journal
of Machine Learning Research, 2008.

[5]

Manoranjan Dash, Huan Liu, Consistency-based search in feature selection, Artificial Intelligence
151, pp.155176, 2003.

[6]

H. R. Bittencourt, R. T. Clarke, Feature Selection by using Classification and Regression Trees


(CART), The International Archives of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, 2004.

22 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853

[7]

Gunnalan R., Menzies T. Appukutty K., Srinivasan A., Hu Y., Feature Subset Selection with
TAR2less, 2003.

[8]

Yang J., Honavar V., Feature Subset selection using a Genetic Algorithm, Feature extraction,
construction and selection. Springer US, pp.117-136, 1998.

[9]

Inza I., Larranaga P., Etxeberria R., Sierra B., Feature Subset Selection by Bayesian network-based
optimization, Artificial Intelligence 123, pp.157-184, 2000.

[10]

Witten I. H., Frank E., Hall M. A., Data Mining: Practical Machine Learning Tools and Techniques,
Third Edition.

[11]

Andrieu C., Freitas N. D., Doucet A., Jordan M. I., An Introduction to MCMC for Machine Learning,
Machine Learning, 50, pp.543, 2003.

[12]

Kubat M., Holte R., Matwin S., Machine Learning for the Detection of Oil Spills in Satellite Radar
Images Machine Learning, 30, pp.195215, 1998.

[13]

Freitag D., Machine Learning for Information Extraction in Informal Domains, Machine Learning,
39, pp.169202, 2000.

[14]

Gupta V., Chhabra J. K., Measurement of Dynamic Metrics Using Dynamic Analysis of Programs,
APPLIED COMPUTING CONFERENCE (ACC '08), Istanbul, Turkey, 2008.

[15]

Aggarwal K. K., Singh Y., Kaur A., Malhotra R., Empirical Study of Object-Oriented Metrics,
Journal of Object Technology, 2006.

[16]

Singhal S., jena M. A Study on WEKA Tool for Data Preprocessing, Classification and Clustering,
International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 2, Issue-6,
2013.

[17]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten, The
WEKA Data Mining Software: An Update, SIGKDD Explorations, vol. 11, Issue 1, 2009.

[18]

Hall M. A., Correlation-based Feature Subset Selection for Machine Learning, Hamilton, New
Zealand, 1998.

[19]

Aha D., Kibler D., Instance-based learning algorithms. Machine Learning, vol. 6, pp.37-66, 1991.

[20]

John G. Cleary, Leonard E. Trigg, K*: An Instance-based Learner Using an Entropic Distance
Measure, 12th International Conference on Machine Learning, pp.108-114, 1995.

[21]

Frank E., Hall M. A., Pfahringer B., Locally Weighted Naive Bayes. 19th Conference in Uncertainty
in Artificial Intelligence, pp.249-256, 2003.

23 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

You might also like