A Multi-Objective Software Quality Classification Model Using Genetic Programming

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO.
2, JUNE 2007
237
A Multi-Objective Software Quality Classication Model Using Genetic Programming

Taghi M. Khoshgoftaar, Member, IEEE, and Yi Liu, Member, IEEE
AbstractA key factor in the success of a software project is achieving the best-possible software reliability within the allotted time & budget. Classication models which provide a risk-based software quality prediction, such as fault-prone & not fault-prone, are effective in providing a focused software quality assurance endeavor. However, their usefulness largely depends on whether all the predicted fault-prone modules can be inspected or improved by the allocated software quality-improvement resources, and on the project-specic costs of misclassications. Therefore, a practical goal of calibrating classication models is to lower the expected cost of misclassication while providing a cost-effective use of the available software quality-improvement resources. This paper presents a genetic programming-based decision tree model which facilitates a multi-objective optimization in the context of the software quality classication problem. The rst objective is to minimize the Modied Expected Cost of Misclassication, which is our recently proposed goal-oriented measure for selecting & evaluating classication models. The second objective is to optimize the number of predicted fault-prone modules such that it is equal to the number of modules which can be inspected by the allocated resources. Some commonly used classication techniques, such as logistic regression, decision trees, and analogy-based reasoning, are not suited for directly optimizing multi-objective criteria. In contrast, genetic programming is particularly suited for the multi-objective optimization problem. An empirical case study of a real-world industrial software system demonstrates the promising results, and the usefulness of the proposed model. Index TermsCost of misclassication, genetic programming, multi-objective optimization, software faults, software metrics, software quality estimation.
# of Type II errors the maximum number of modules which can be tested # of modules classied as members of class , where can be either R, or G # of modules of class predicted as members of class , where , and can be either R, or G # of the modules in the data set a vector of independent variables an objective function number of objective functions to be considered a Pareto-optima size of the Pareto-optima set tness value for each individual generation the Pareto-optima set in the generation the set of Pareto-optima solutions tness function the ratio of over the ratio of over # of times the source le was inspected prior to the system test release # of lines for the source le prior to coding phase # of lines of code for the source le prior to system test release # of lines of commented code for the source le prior to coding phase # of lines of commented code for the source le prior to system test release the ratio of over ACRONYM1 expected cost of misclassication fault-prone genetic programming modied expected cost of misclassication not fault-prone software quality assurance software quality classication strongly typed genetic programming
or
Type I error Type II error R, or Red G, or Green
NOTATION an error of misclassifying a fp module to a nfp module an error of misclassifying a nfp module to a fp module fp module nfp module cost of a Type I error cost of a Type II error cost ratio of a Type II error over a Type I error, i.e., # of Type I errors
ECM
fp
GP MECM
Manuscript received August 21, 2003; revised December 18, 2003 and April 8, 2004; accepted April 30, 2004. This work was supported in part by the NSF grant CCR-9970893. Associate Editor: M. Xie. T. M. Khoshgoftaar is with the Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431 USA (e-mail: taghi@cse.fau.edu). Y. Liu is with Georgia College & State University, Milledgeville, GA 31061 USA (e-mail: yi.liu@gcsu.edu). Digital Object Identier 10.1109/TR.2007.896763
nfp
SQA SQC STGP
1The
singular and plural of an acronym are always spelled the same.
0018-9529/$25.00 2007 IEEE
238
IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007
I. INTRODUCTION N THE context of software quality & reliability improvement endeavors, the software quality assurance (SQA) team is often faced with the difcult task of working with nite & limited resources. The team aims to make the best of available software inspection & quality improvement resources. A logical approach for achieving cost-effective software quality improvement is a risk-based targeting of software modules depending on their predicted quality [1]. Software metrics-based quality classication models have proven their usefulness for timely software quality improvements [2][4]. Typically, software modules are categorized by a classication model into two risk-based groups, such as fault-prone (fp), and not fault-prone (nfp) [5], [6]. A classication model is calibrated or tted using software metrics, and fault data collected from a previously developed system release, or similar software project. Subsequently, the tted classication model can be applied to estimate the quality of modules which are currently under-development. However, existing software quality classication (SQC) techniques do not consider the limited resource-availability factor during their model calibration process. The usefulness of an SQC model depends on whether enough resources are available to inspect all modules predicted as fp. For example, if a model predicts 30% of the modules as fp, but available resources can only inspect 10% of the modules, then how does one choose which of the predicted fp modules to inspect? It is therefore important that a resource-based SQC classication model is calibrated2. Moreover, because it is unrealistic to obtain a model which yields perfect classication, i.e., all predicted fp modules are actually fp, a classication technique should aim to minimize the expected cost of misclassication in the context of the software system & application domain [7]. In addition to resource-usage & the misclassication costs, other factors also need to be considered, such as model-simplicity, model-interpretation, etc. We have mentioned that, to calibrate a goal& objective-oriented classication model, multiple criteria have to be optimized. However, commonly used classication techniques, such as logistic regression [8] & decision trees [5], cannot simultaneously attain a multi-objective optimization during their modeling process. Modeling complicated engineering problems, such as software quality prediction, with traditional mathematical optimization methods, is practically unfeasible [9]. This paper presents a genetic programming (GP)-based decision tree model that is capable of obtaining a multi-objective optimization. As a member of the evolutionary computational methods [10][13], GP has been explored to solve some multi-optimization problems [14], [15]. This study is a continuation of our recent research efforts with GP-based software quality classication models [16], [17], and focuses on the simultaneous optimization of decision trees with respect to the following two factors:
1) Modied Expected Cost of Misclassication (MECM) [18] (see Section II-A), and 2) number of predicted fp modules, such that it is equal to the number which can be inspected by the allocated software quality improvement resources. Although MECM contains a provision to penalize the models whose number of fp modules is greater than that which can be tested by the available resources, GP can still nd several models which have very similar, or the same MECM values, but have a different number of predicted fp modules. Hence, given two such models, GP will give preference, based on the second objective, to the one which provides the approximately same number of fp modules as that required by available resources. Moreover, to accelerate the run times of GP, and yield a relatively simple model, a third optimization factor, i.e., minimizing the size of the decision tree, is introduced in our GP-based decision tree modeling process. A method based on GP for nding the solutions in the Pareto set [9] is proposed when building the classication model. Assessing the usefulness of a classication model based solely on its misclassication error rates, i.e., Type I & Type II, is inappropriate because of the disparate misclassication costs associated with the individual error types.3 In an earlier study [19], we have investigated the application of the Expected Cost of Misclassication (ECM) as a singular unied model-evaluation measure which can be used to incorporate the project-specic misclassication costs. Though ECM-based model evaluation demonstrates effective results, it does not reect the performance of the model in the context of allocated resources. More specically, such an approach is based on the assumption that the project has enough resources to inspect all the modules predicted as fp. To overcome this limitation of ECM, in a recent study we proposed an improved version called MECM [18], which facilitates achieving resource-based SQC models for a given resource allocation. The basic functionality of MECM is that it penalizes a classication model in terms of the costs of misclassications, if the model predicts an excess number of fp modules than the number which can be inspected by the allocated resources. Therefore, at the time of model-calibration, the SQA team can provide information regarding how many modules can be inspected & improved with the allotted resources, and the approximate costs of misclassications. Estimating the actual costs of misclassications at the time of modeling is a difcult problem. However, an educated approximation is usually made based on heuristic software engineering knowledge gained from previously developed similar software projects. Based on the project-specic knowledge of available resources, and the costs of misclassications, the MECM measure can be used to yield a resource-based SQC model. Consequently, the best possible & practical usage of the available resources can be achieved. The empirical case study used to illustrate the GP-based multiobjective SQC model consists of software metrics, and fault data collected from two embedded software applications from the wireless telecommunications industry. Other systems were also
3A Type I error occurs when a nfp module is misclassied as fp, whereas a Type II error occurs when a fp module is misclassied as nfp.
2Some classiers, such as count models, provide a probability that a module has a given number of faults. However, they are not suited for a multi-objective optimization problem such as that being addressed in this paper.
KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL
239
studied, however, their results are not presented. The paper continues in Section II with a discussion on ECM & MECM as performance measures. Section III provides an overview of GP-based modeling, including multi-objective optimization & our tness functions. An empirical case study, and results are discussed in Section IV, and Section V, respectively. II. EXPECTED COST OF MISCLASSIFICATION The application of ECM [20] in software quality classication modeling was initially introduced by our research team in the context of controlling the overtting tendencies of classication trees [5]. By effectively incorporating the costs of misclassications, ECM provides a more practical insight into the performance of a classication model within its application domain. If we denote a fp module as Red, a nfp module as Green, as the cost of a Type I error, as the cost of a Type II error, as the number of actual nfp modules misclassied as as the number of actual fp modules misclassied as fp, and nfp; then ECM is given by (1) where is the number of modules in the data set. To express the , we normalize in terms of the cost ratio . (1) with respect to
. This scenario indicates that the Case 1: allocated software quality improvement resources are enough to inspect all modules predicted as fp, i.e., the assumption made by the original ECM measure. Therefore,
(2) . This scenario indicates that Case 2: the allocated resources can inspect only a portion of the modules predicted as fp. Given this resources constraint, the selected classication model should aim to predict a number of fp modules (preferably with a minimum number of nfp modules) which . Using the ECM measure is approximately equal to for such a scenario would be inappropriate because it assumes all predicted fp modules will be inspected. To account for the limited resource-availability factor, the following modications need to be made to ECM. 1) Penalize a model for classifying surplus fp modules which cannot be inspected, and are actually fp, i.e., . This corresponds to an increase in the expected cost of misclassication. 2) Subtract a penalty for classifying surplus fp modules which cannot be tested, and are actually nfp, i.e., . This factor is removed because it is already included in the ECM term of (2), and corresponds to a decrease in the expected cost of misclassication. The appropriate MECM for this case is given by,
ECM
A. Modied Expected Cost of Misclassication The advantages of MECM over ECM can be observed via an example. Consider two classication models, i.e., Model A, and Model B, calibrated for a software project which has resources to inspect only 20% of the program modules. In addition, assume Model A predicts 35% of the modules as fp, while Model B predicts 15% of the modules as fp. Because Model A classies a larger proportion of modules as fp (a lower Type II error rate), ; Model A is likely to have a lower ECM (1) and than Model B, leading to the conclusion that Model A is better than Model B. However, if Model A is considered to be the preferred classication model, how does the SQA team justify which of the 35% modules (predicted as fp) will be subjected to quality-improvements? If the team randomly picks modules from the predicted fp pool, then a best return-on-investment for the software quality improvement initiatives is not assured. On the other hand, if all modules predicted as fp, i.e., 35%, are inspected, then lower-than-usual resources may be deployed to the high-risk software modules. Therefore, evaluating classication models based solely on the ECM computed by (1) is not practical in regards to a cost-effective resource utilization. The proposed MECM measure provides an effective solution to such a problem. Depending on the likely resource-availability scenario, the appropriate value of MECM can be computed as , shown by the following cases [18]. The notations respectively represent the maximum number of modand ules which can be tested, and the number of modules classied represents the number as members of class . In addition, of modules of class predicted as members of class , where , and , can be either R, or G.
(3) . This scenario indicates that the Case 3: allocated resources are such that, besides inspecting all the modules predicted as fp, a few modules from the predicted nfp group of modules can also be inspected. The problem with such a scenario is related to the strategy used for selecting the additional modules to inspect, because it is likely that, among the predicted nfp modules, only a very small fraction is actually fp. If the additional modules are selected such that most of them are actually nfp, then the return-on-investment of quality inspection resources is not maximized. To simplify the selection of the surplus modules for inspection, in our study the additional modules are picked randomly from the predicted nfp pool. Because the original ECM measure does not account for the resource-availability scenario for this case, the following adjustments are made to yield an equation for MECM. 1) A penalty is added for the predicted nfp modules which are tested using the surplus resources (i.e., ), but are actually nfp, i.e., . This corresponds to an increase in the expected cost of misclassication. 2) A penalty is subtracted for the predicted nfp modules which are tested using the surplus resources, but are ac. tually fp, i.e.,
240
This factor corresponds to a decrease in the expected cost of misclassication. The appropriate MECM for this case is given by
(4) III. GENETIC PROGRAMMING An inherent advantage of GP is that it can evolve a solution automatically from the training data [10], and does not require an assumption regarding the mathematical model of the structure or the size of the decision tree-based solution. The evolutionary process of GP attempts to imitate the Darwinian principle of survival of the ttest individuals. The tness value of an individual is an indicator of its quality, and hence, provides a probability of which individuals can be selected for mating, and the subsequent reproduction of the next generation. We note that an in-depth discussion of GP is avoided. However, some key elements of GP are briey explained [10]. GP uses the basic unit of the associated problem to assemble each individual. The basic unit may include function sets, and terminal sets. The typical structure of each individual can be seen as a tree-shaped structure, and each individual may be unique. There are three main operators for GP: reproduction, crossover, and mutation. Each uses random processing on one or more individuals. These three operators are used by GP for a tness-based evolution. The tness factor is a measure used by GP during its simulated evolution of how well a program (individual) has learned to yield the correct output based on the given inputs. Decision tree-based classication models have been recognized as useful tools for data mining purposes [21]. This study is a continuation of our previous efforts [16], [17] in which an approach for building GP-based decision tree models for classifying software modules either as fp or nfp was presented. The commonly used standard-GP requires that the function set, and the terminal set have closure properties, i.e., all functions in the function set must accept all kinds of data types & data values as function arguments [22]. However, this requirement does not guarantee that GP will generate a useful decision tree model. Montana recognized this problem, and proposed the Strongly Typed Genetic Programming (STGP) approach [23]. It relaxes the closure property requirement of standard GP by introducing additional criteria for genetic operations. More specically, given a precise description of the permissible data types for function arguments, STGP will only generate individuals based on the constraint that the arguments of all functions are of the correct type [23]. We only discuss the modications made to standard GP in order to build a decision tree model for software quality classication purposes. 1) Constraint: Each function & terminal are assigned a type. Different types may not crossover or mutate under certain problem-specic constraints. The leaf node is a function which returns the class membership of a module, and the decision nodes (non-leaf nodes) are simple logical equations which return either true, or false. Hence, only constants & independent variables appear in the decision nodes. The function which returns the class membership
of a module is not used in the internal nodes. Moreover, a root node can be either a leaf node, or an internal node. 2) Crossover: In this stage, additional limitations are applied to the genetic operation. In the context of GP-based decision trees, we dene that the type of a subtree is the type of its root node. Therefore, when two subtrees are selected for crossover, they are required to have the same type so that a proper decision tree is generated. 3) Mutation: If a subtree is selected for mutation purposes, the replaced tree must have the same type or at least a similar type. A subtree of a similar type is one which, when used as a replacement, i.e., mutation, yields a new tree which is a permissible or proper decision tree. For example, the leaf node is a function which returns the class membership of a module, and can be replaced by a new subtree whose root node is a logical equation. A. Multi-Objective Optimization A multi-objective optimization solution usually aims at obtaining a set of Pareto-optima [9], which represents feasible solutions where a solution to a single objective cannot be improved without sacricing the solution for some (or more) other criteria. Let be a vector of independent variables, be an objective function, and be the number of objective functions to be considered for optimization. A vector is a Pareto-optima iff there is no vector which exists with the characteristics,
In the case of all non Pareto-optima vectors, the solution for at least one objective function, , can be improved without sacricing the solutions for any of the other objective functions. The most frequently used methods for generating Pareto-optima are based on the notion of replacing the multi-objective problem with a parameterized scalar problem [14]. Typically, by varying the value of each parameter for each objective, it is possible to generate all or parts of the Pareto-optima set. The multi-objective optimization problem addressed in this study includes three objectives: 1) to minimize the modied expected cost of misclassication (MECM), 2) to obtain the number of modules predicted as fp such that it is equal to the number of modules which can be inspected with the available software quality improvement resources, and 3) to minimize the size of the decision tree model. The third objective addresses the important issue of simplicity, and ease in model-interpretation & comprehension. In addition, limiting the size of the decision tree assists in the acceleration of GP run times. The optimization method adopted in our study is based on the Non-dominated Sorting Genetic Algorithm, as proposed by Srinivas and Deb [24]. Because minimizing the MECM value is the most important objective, sorting by the rst objective ensures that it is the least likely to be violated. On the other hand, the third objective, i.e., minimizing the size of the decision tree, is the one most likely to be violated during the process of each run. Let be the size of the Pareto-optima set, be the tness
241
value for each individual, be the generation, be the generation, and be the set of Pareto-optima set in the Pareto-optima solutions we are interested in after the given run. , , and is an empty set. At the beginning of a run, Each GP run consists of the following steps: Step 1: Sort the population according to the increasing values of the rst objective. If several individuals have the same values, then order them successively by the increasing values of the second, and third objectives. Step 2: Select all non-dominated individuals from the population based on their sorted order. Assign to the tness of the non-dominated individuals. Then increase by 1. Step 3: The individuals which are selected & assigned a tness in the previous step are ignored. Repeat the process of selecting the non-dominated individuals, and return to Step 2. If all individuals have already been selected, then continue to Step 4. as the Pareto-op Step 4: Save the rst individuals to tima set of the current generation, and compare each soluwith each solution of . The rst individuals tion in & are saved into again. among When the computation is completed, each individual will be selected to breed according to the probability dened by the tness value obtained from the above steps. Subsequent to the last genwill represent the best models of the run. eration, B. Fitness Functions 1) Minimizing MECM: A lower MECM value represents a preferred classication model. Because misclassication costs are specic to the software project & the development organization, the SQA team can realize an educated approxi, based on heuristic mation of the cost ratio, i.e., software engineering knowledge gained from previously developed similar software projects. The applied penalty of misclassication is dened as follows: if a nfp module will be applied to is misclassied as fp, a penalty of the tness of that particular classication model. On the same token, if a fp module is misclassied as nfp, a penalty will be applied to the respective tness value. The of tness function for this objective is given by (5) 2) Resource availability: The aim here is to penalize a model if it predicts a surplus (or a decient) number of fp modules , that can be than the maximum number, i.e., inspected or tested by the available resources. Therefore, if the total number of modules predicted as fp is equal to , then the tness function is equated to zero, implying that there is no penalty. Otherwise, the tness function is given by (6) , and . A lower value of (6) implies a better performance in regards to the second objective. where
3) Size of decision tree: We dene the size of a decision tree by its number of nodes; the smaller the size, the better is the tness for the third objective. We select a threshold value of ve, implying that if the tree-size is less than ve, then the tness of that specic classication tree is ve. The minimum size of the tree was empirically set to ve nodes in order to prevent any loss of diversity in the GP population.
IV. EMPIRICAL CASE STUDY A. System Description The case study involved data collection efforts from two large Windows-based embedded system applications used for customizing the conguration of wireless telecommunications products. The two C++ applications provide similar functionality, and contain common source code. The primary difference is the type of wireless product that each supports. Both systems comprised of over 1400 source code les, and contained more than several million lines of code each. Software metrics were obtained by observing the conguration management systems, while the problem reporting systems tracked & recorded the status of different problems. Information, such as how many times a source le was inspected prior to system tests, was recorded. The obtained software metrics reected aspects of source les, and therefore, a software module for the systems consisted of a source le. The fault data collected represent the faults discovered during system tests. Upon preprocessing & cleaning the software data, 1211 modules remained & were used for model calibration. Data preprocessing primarily included the removal of observations with missing information or incomplete data. The decision to remove certain modules from the data set was based on our discussion with the development team. Among the 1211 modules considered for modeling, over 66% (809) were observed to have no faults, while the remaining 402 modules had one or more faults. The ve software metrics used for this case study are: , number of times the source le was inspected prior to , number of lines for the source the system test release; le prior to coding phase; , number of lines of code for , number the source le prior to system test release; of lines of commented code for the source le prior to coding , number of lines of commented code for the phase; and source le prior to system test release. The available data collection tools determined the number & selection of the metrics. The product metrics used are statement metrics for the source les. They primarily indicated the number of lines of source code prior to the coding phase (i.e., auto-generated code), and , was just before system tests. The process metric, obtained from the problem reporting systems. The module-quality metric, i.e., the dependent variable, is the number of faults observed during system test. The SQC are dependent on the chosen threshold value (of the quality metric), which identies modules as either fp or nfp. The main stimulus for selecting the appropriate threshold value is to build the most
242
useful & system-relevant SQC model possible. Therefore, the selection of the threshold value is usually dependent on the software quality improvement needs of the development team. Because software metrics data from subsequent releases was not available, an impartial data splitting was applied to the data set to obtain the t & test data sets. Consequently, the t, and test data sets had 807, and 404 modules, respectively. The classication models for this case study are calibrated to classify a software module as either fp or nfp, based on a threshold of two faults; i.e., if a module has two or more faults, it is categorized as fp, and as nfp otherwise. According to the selected threshold value, the t data set has 632 nfp modules, and 175 fp modules, whereas the test data set has 317 nfp modules, and 87 fp modules. The selection of a threshold value for the number of faults is specic to a given software project. B. Empirical Settings The modeling tool used for our empirical studies is the lilgp (version 1.01), developed by D. Zongker & B. Punch of Michigan State University. It is implemented in C, and is based on the LISP works of J. Koza [22]. When applying lilgp for a GP application, each individual is organized as a decision tree in which a node is a C function pointer. The execution speed of lilgp is faster than that of interpreted LISP. The below-mentioned modeling methodology is adopted for calibrating software quality classication models with a multiobjective optimization. The procedure is based on the given pair inputs of 1) how many modules can be inspected or tested according to the available resources, and 2) the project-specic cost ratio. 1) Divide the modules in the t & test data sets into two classes, i.e., fp & nfp, according to the chosen threshold value mentioned in the previous section. 2) Build a GP-based multi-objective classication model according to the procedure discussed in Sections III, III-A, and III-B. 3) Compute the quality-of-t performance of the classication model based on the quality estimation of the modules in the t data set, the number of predicted fp modules, and the size of the decision tree. The accuracy of a model, i.e., , indicates that according to the allocated resources, how many of the actual fp modules are predicted as fp, i.e., (7) where is the ratio of the number of modules which can be inspected by the available resources to the total number of modules in the respective data set. The equation represents the inspection efciency for the given amount of resources. 4) Apply the classication model to the test data set to evaluate its predictive performance. Validating the model on an independent data set can provide an indication regarding the accuracy of the classication model if it were applied to a currently under-development system release (or similar software project) with known software metrics, but unknown quality data.
TABLE I PARAMETERS FOR GP
The independent variables are the ve software metrics described earlier, whereas the dependent variable is the class membership based on the number of faults observed during system test. Because the actual cost of misclassications is usually unknown at the time of modeling, it is benecial to the project management team to calibrate classication models for a range of cost ratios which are likely to suit the projects needs. For example, we consider different values for the cost ratio, denoted as . The cost ratios considered in our study are 2, 5, 10, 17.5, 25, and 50. Based on our empirical software engineering knowledge, this set of values covers a broad spectrum of possible cost ratios. However, from a practical software engineering point of view, the cost ratios of 2 & 5 are very unlikely for the embedded systems being modeled in this study. In each generation, our GP process will automatically select ve best models (parameter output.bestn) from the Pareto set it nds. If the Pareto set contains more than 5 models according to its non-domination order, the models will be sorted by using the relative importance of the objectives. The rst ve (from the top) models which have lower MECM values will be represented at the end of the GP run. We performed 20 runs for each combination of cost ratio , and available resources ; thus, 100 models were recorded for each combination of , and . The best decision tree (lowest MECM value for t data) which appears during the 20 runs is selected as the preferred classication model for the given combination. In the context of GP-based modeling, certain parameters have to be assigned for a given set of input conditions. In our study, the parameters that were used for each value of are listed in Table I. Some of the basic parameters, such as depth (init.depth)
243
of the initial GP population, and the method (init.method) for generating the initial population, were assigned default values provided by John Koza. Other parameters, such as population size (pop_size), maximum number of generations (max_generations), maximum depth of the population (max_depth), crossover rate, reproduction rate, and mutation rate, were empirically varied during our study. The selection method for the three genetic operators is tness-proportional selection. The function set only contains four functions (C1, and C2 represent the nfp, and fp classes, respectively) which are essential for building a binary decision tree model. We observed that our GP process was not sensitive to the change of the above mentioned parameters as long as a reasonable range of values were used, such as population size 500, number of generations 100, and mutation rate 0.0. We note that a detailed analytic study to nd an optimal combination of the GP parameters is out of scope for this study. V. DISCUSSION The performance of the preferred classication models for the cost ratios of 2, and 10 are presented in Tables II and III, respectively. The results for the other four cost ratios are not presented due to similarity of empirical conclusions, and paper size considerations. The rst column, , indicates the fraction of the total number of modules which can be inspected or tested according to the available software quality improvement resources. The second, and third columns indicate the Type I, and Type II misclassication error rates, respectively. The fourth, and fth columns respectively indicate the number of modules predicted as fp, and the number of modules that can be . The sixth column indicates the modied inspected, i.e., expected cost of misclassication values for the respective , the classication models. The last column represents performance accuracy of the classication model as dened by (7), i.e., the percentage of the predicted fp modules which are actually fp. As an illustration, consider the classication model shown in , and . In the case the rst row of Table II, i.e., for of the t data set, the model predicts 41 modules as fp. When , we observe that the comparing this number to model performs exceptionally well. Moreover, the performance , implying that all of the model is perfect, i.e., modules predicted as fp are actually fp. Hence, we see that the quality-of-t of this model is excellent. The Type II error rate is very large because the model is forced to optimize the predicted number of fp modules to 40, implying that actual fp modules are misclassied. Let us now examine the predictive capability of this model, i.e., its performance on the test data set. We observe that 23 modules are classied as fp as , implying the model performs compared to implies very well. In addition, the performance that over 90% of predicted fp modules are actually fp. An example binary decision tree model obtained for the case study is presented in Fig. 1. The model corresponds to a cost ratio of 10, and resource availability of 0.40. The leaf nodes are labeled as either fp, or nfp. A non-leaf tree node shows a specic software metric, and its threshold, which is used to identify the
TABLE II CLASSIFICATION MODELS FOR c = 2
Fig. 1. GP decision tree model for c = 10, and p = 0:40.
subsequent traversal path. For example, the root node indicates that if LOCA (number of lines of comments for the module prior to the coding phase) is greater than 3, then the right subtree
244
TABLE III CLASSIFICATION MODELS FOR c = 10
intermixed collection of fp & nfp modules. It is at this middle portion of the spectrum that the misclassication errors are more likely to occur, primarily because of data points that do not follow the general trend of the data set. Therefore, for inspection purposes for a given amount of resources, the needed modules would be picked starting from the right end of the spectrum. , it was observed that for both t, and test When data sets, the performance was very good, i.e., 90% to 100%, across all the cost ratios. This is analogous to selecting 5% of the modules starting from the right end of the spectrum, which are more likely to be actually fp. On the other hand, when , we observed that the performance was only about 40% to 48% for all the cost ratios. The performance reduction is intuitive because, when a larger percentage of modules can be inspected, many nfp modules will invariably be agged as fp (as per the spectrum analogy presented earlier), thus lowering the performance of the model, and correspondingly increasing the with an Type I error rate. The relationship of a decrease in increase in the Type I error rate is seen in the tables. Moreover, when comparing the MECM values of the quality-of-t (t data), and predictive-quality (test data) performances of the different classication models, we note that the respective models do not demonstrate over-tting tendencies, and generally maintain the achieved quality-of-t performance. In a related empirical case study of the software system presented in Section IV-A, we applied our GP-based classication technique to build classication models which predict modules as either change-prone, or not change-prone. The module-quality metric (dependent variable) in that study was the number of lines of code churn, which was dened as the summation of the number of lines added, deleted, or modied during system test. Empirical observations & conclusions made from the study were similar to those presented in this paper. Due to the similarity of results, and paper-size concerns, we have not included those results in this paper. VI. CONCLUSION
of the root node is traversed where other conditions are tested; otherwise, the module is classied as nfp. The performance details of the GP model for the t & test data sets can be observed in Table III. The table indicates that decreases. Moreover, as as increases, the performance increases, we observe an inverse relationship between the Type I, and Type II error rates; i.e., as increases, the Type I error rate increases while the Type II error rate decreases. For the different values, the number of modules predicted as fp by the model is very close to the number of modules which can be inspected, suggesting an optimized solution with respect to resource utilization. Similar observations were also made from the performances for the other cost ratios. An intuitive analysis decreases with is now presented. of why performance Assume that modules are scattered across a spectrum with respect to their predicted class, such that the left end represents the least faulty software module, while the right end represents the most faulty module. Based on this assumption, the actual fp modules are likely to be concentrated towards the right; the actual nfp modules are likely to be concentrated towards the left; and the middle portion of the spectrum will be comprised of an
This study presents a genetic programming-based multi-objective optimization modeling technique for calibrating a goaloriented software quality classication model geared toward a cost-effective resource utilization. Using case studies of two wireless conguration applications, software quality classication models are calibrated. The effective multi-objective optimization capability of GP is demonstrated through a representative case study in which models were calibrated to predict software modules as either fault-prone or not fault-prone. Multi-objective optimization is often a practical need in many real-world problems, such as software quality estimation modeling. An advantage of GP is that it does not require extensive mathematical assumptions about the size, and structure of the optimization problem. In software engineering, the importance of making the best of the limited software inspection & testing resources is well founded. In addition to effective resource utilization, the project-specic costs of misclassications also affect the usefulness of classication models. Building on our previously developed GP-based software quality classication modeling technique, this study focuses on calibrating models which optimize the following criteria
245
in a descending order of importance: 1) minimizing our recently proposed modied expected cost of misclassication, a goal-oriented model-selection, and model-evaluation measure for software quality classication models; 2) the number of predicted fp modules be equal to the number of modules which can be inspected by the allocated resources; and 3) controlling the size of the decision tree to facilitate comprehensibility in model interpretation, and provide faster GP-runs. In addition to presenting a multi-objective classication modeling technique, a method of nding solutions in the Pareto-optima set is also proposed. In summary, we have shown that the proposed model achieves very good performance in the context of optimization of the three modeling objectives. Therefore, a project management team can simply provide the proportion of software modules which can be inspected or tested, and the project-specic cost ratio which can be approximated based on heuristic knowledge obtained from similar projects. Subsequently, the proposed classication model will yield a goal-oriented & resource-based classication model. Future research will focus on improving the performance of the model by optimizing the GP-parameters. ACKNOWLEDGMENT The authors would like to thank the associate editor, Dr. Min Xie, and the anonymous reviewers for their comments. They thank Naeem Seliya for his suggestions, assistance with modications, and patient editorial reviews of this paper. They also thank Kehan Gao for her reviews. REFERENCES
[1] N. F. Schneidewind, Body of knowledge for software quality measurement, IEEE Computer, vol. 35, no. 2, pp. 7783, February 2002. [2] L. C. Briand, W. L. Melo, and J. Wust, Assessing the applicability of fault-proneness models across object-oriented software projects, IEEE Trans. Software Engineering, vol. 28, no. 7, pp. 706720, July 2002. [3] Y. Ping, T. Systa, and H. Muller, Predicting fault-proneness using OO metrics: An industrial case study, in Proceedings: 6th European Conference on Software Maintenance and Reengineering, T. Gyimothy and F. B. Abreu, Eds., Budapest, Hungary, March 2002, pp. 99107. [4] P. Runeson, M. C. Ohlsson, and C. Wohlin, A classication scheme for studies on fault-prone components, Lecture Notes in Computer Science, vol. 2188, pp. 341355, 2001, springer Link. [5] T. M. Khoshgoftaar, E. B. Allen, and J. Deng, Controlling overtting in software quality models: Experiments with regression trees and classication, in Proceedings: 7th International Software Metrics Symposium, London, UK, April 2001, pp. 190198, IEEE Computer Society. [6] T. M. Khoshgoftaar, E. B. Allen, and J. Deng, Using regression trees to classify fault-prone software modules, IEEE Trans. Reliability, vol. 51, no. 4, pp. 455462, December 2002. [7] T. M. Khoshgoftaar and E. B. Allen, A practical classication rule for software quality models, IEEE Trans. Reliability, vol. 49, no. 2, pp. 209216, Jun. 2000. [8] N. F. Schneidewind, Investigation of logistic regression as a discriminant of software quality, in Proceedings: 7th International Software Metrics Symposium, London, UK, April 2001, pp. 328337, IEEE Computer Society.
[9] H. Eschenauer, J. Koski, and A. Osyczka, Multicriteria Design Optimization: Procedures and Applications, 2nd ed. Germany: SpringerVerlag, 1990. [10] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic Programming: An Introduction On the Automatic Evolution of Computer Programs and Its Application. New York: PWS Publishing Company, 1998. [11] J. Holland, Adaptation in Natural and Articial Systems, 2nd ed. Ann Arbor: University of Michigan Press, 1992. [12] H. Iba, H. de Garis, and T. Sato, Genetic programming using a minimum description length principle, in Advances in Genetic Programming. : MIT Press, 1996. [13] P. Sngeline and K. E. Kinner, Advances in Genetic Programming. Cambridge: MIT Press, 1996, vol. II. [14] Z. Michalewicz, D. Dasgupta, R. G. L. Riche, and M. Schoenauer, Evolutionary algorithms for constrained engineering problems, Computers and Industrial Engineering, vol. 30, pp. 851870, 1996. [15] A. Osyczka, Evolutionary Algorithms for Single and Multicriteria Design Optimization. New York: Physica Verlag Heidelberg, 2002. [16] T. M. Khoshgoftaar, Y. Liu, and N. Seliya, Genetic programmingbased decision trees for software quality classication, in Proceedings of the 15th International Conference on Tools with Articial Intelligence, Sacramento, California, USA, November 2003, pp. 374383, IEEE Computer Society. [17] Y. Liu and T. M. Khoshgoftaar, Genetic programming model for software quality prediction, in Proceedings: 6th International High Assurance Systems Engineering Symposium, Boca Raton, Florida, USA, October 2001, pp. 127136, IEEE Computer Society. [18] T. M. Khoshgoftaar, N. Seliya, and A. Herzberg, Resource-oriented software quality classication models, The Journal of Systems and Software, vol. 76, no. 2, pp. 111126, 2005. [19] T. M. Khoshgoftaar and N. Seliya, Comparative assessment of software quality classication techniques: An empirical case study, Empirical Software Engineering Journal, vol. 9, no. 3, pp. 229257, 2004. [20] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 2nd ed. Englewood Cliffs, NJ, USA: Prentice Hall, 1992. [21] R. S. Michalski, I. Bratko, and M. Kubat, Machine Learning and Data Mining: Methods and Applications. : John Wiley and Sons, 1998. [22] J. R. Koza, Genetic Programming. New York: MIT Press, 1992, vol. I. [23] D. J. Montana, Strongly typed genetic programming, in Evolutionary Computation. : , 1995, pp. 199230. [24] N. Srinivas and K. Deb, Multiobjective optimization using non-dominated sorting in genetic algorithms, Evolutionary Computation, vol. 2, pp. 221248, 1994. Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University, and the Director of the Empirical Software Engineering Laboratory, and the Data Mining and Machine Learning Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair, and General Chair of the IEEE International Conference on Tools with Articial Intelligence in 2004, and 2005 respectively. He has served on technical program committees of various international conferences, symposia, and workshops. Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality, and Fuzzy systems.
Yi Liu received the Ph.D. degree in Computer Science from the Department of Computer Science and Engineering at Florida Atlantic University in 2003. She is currently an assistant professor in the Department of Mathematics and Computer Science at Georgia College and State University. Her research interests include software engineering, software metrics, software reliability and quality engineering, computer performance modeling, genetic programming, and data mining.

A Multi-Objective Software Quality Classification Model Using Genetic Programming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Multi-Objective Software Quality Classification Model Using Genetic Programming

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO.

A Multi-Objective Software Quality Classication Model Using Genetic Programming

Type I error Type II error R, or Red G, or Green

singular and plural of an acronym are always spelled the same.

0018-9529/$25.00 2007 IEEE

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

TABLE I PARAMETERS FOR GP

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL

TABLE II CLASSIFICATION MODELS FOR c = 2

Fig. 1. GP decision tree model for c = 10, and p = 0:40.

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

TABLE III CLASSIFICATION MODELS FOR c = 10

KHOSHGOFTAAR AND LIU: A MULTI-OBJECTIVE SOFTWARE QUALITY CLASSIFICATION MODEL

You might also like