Bug Prediction Optimization with ML
Bug Prediction Optimization with ML
Tripti Lamba
CSE
Chandigarh University
Punjab,India
[email protected]
for an analyst to work on and compare to the entire system. It is
Abstract—Predicting a bug and attaining a successful
application is critical in today's scenario during the
always recommended to perform resource-intensive, time-
development phase of a program. This can only be consuming, and expensive sedation activities[10].
accomplished by foreseeing some of the shortcomings in the
early stages of development, resulting in software that is
dependable, efficient, and of high quality. A challenging aspect
is to develop a sophisticated model capable to determine the
error and producing effective software. A few ML methods are
utilized to achieve this, and they produce accuracy with both
trained and test datasets. The novelty of this approach is to
demonstrate the applicability of machine learning algorithms
namely Neural Network, SVM, Decision Tree and Cubist in
using different performance metrics i.e. R, R square, Root
Mean Square Error, Accuracy and obtaining the optimal
outcome-based algorithm for a Bug report on diversion dataset
from PROMISE repository. Findings reveal that SVM is
giving significantly higher accuracy among all the algorithms
in the ANT dataset and integrates the existing work on
detecting a bug in software by providing information about
various aforementioned methods in bug prediction The
proposed work is highlighting the accuracy obtained by the
current approaches that are significant for research scholars
and solution providers. Fig. 1: Notable data development process.
Keywords— Machine Learning, SVM, Software The study's objective is to determine the best bug detection
Performance Metrics, Accuracy, Bug Prediction. algorithm using machine learning, evaluate the accuracy of the
entire algorithm, and compare them. The optimal algorithm will
I. INTRODUCTION make it simple for the user to evaluate the findings[11].
To address issues related to ever-larger and more The paper is structured in sections: Section II addresses related
complicated data sets, data science and machine learning literature, and Section III is elaborating on the OPABP model
approach i.e. supervised learning[1], unsupervised structuring. Section IV includes all of the statistical analysis used
learning[2] and reinforcement learning[3] are now widely for ML, and Section V includes the analysis results evaluated
used throughout the science and engineering fields[4]–[6]. using ML, Section VI is containing the conclusion and highlighted
The optimization, analysis, control, and design of the the future work of the research.
proposed system or process that provides data collection are
frequent issues that are posed to scientists and engineers [7]. II. RELATED WORK
Data sets with millions of samples each and as many For the software defect problem, a hybrid classifier is suggested
features as feasible can be used to describe many machine for five NASA datasets[12], the suggested classifier's performance
learning tasks. To select an efficient strategy and effectively is contrasted with that of competing algorithms. Instead of trying
simulate the system at hand, it is essential to comprehend to find a better classifier, more attention should be paid to data
the nature of the problem. Self-driving automobiles, speech pretreatment, feature selection, and other data mining
recognition, and facial recognition are a few examples of approaches[13]. Due to its long-term practical need, High Impact
complicated issues that call for numerous ways to be solved Bug Report prediction is an essential research issue. The author X.
[8]. Wu et al.[14] discussed high impact bug predictor, an automated
method for locating particular categories of bug reports from huge
Bug prediction can be accomplished with the aid of machine bug repositories. For data labelling, the computer-human
learning and predictive analysis. Developers can make interaction mode and active learning are used to reduce effort. The
improvements as they create code by integrating the prediction
most statistically effective strategy frequently comes from one of
models into their development environments. Even though
the many “newly" produced combinations, indicating that the
models can't be created in a manner that comes close to
state-of-the-art transfer learning and classification combinations
perfection. However some, inaccurate predictions are
are still far from being fully developed. Ke Li et al. [15]findings
unavoidable. There are two types of these incorrect
offer insightful information that the practitioners in this particular
predictions: those that incorrectly label clean code as buggy
research sector can use. Also talked about a sophisticated
and those that incorrectly label buggy code as clean. Obtaining
optimizer for Cross Project Defect Prediction based on such that
an ideal model that balances the incorrect predictions is crucial
explores the parameter space for the transfer learning part. U. Ali
to inspiring developers to trust the model. The models have
et al.[16] suggested a classification framework for the
been studied in terms of their level of accuracy and complexity
identification of software modules that is likely to contain defects.
despite the lack of common benchmarks for model
comparison[9]. The accuracy of the model is greatly influenced The researchers' main efforts to boost performance were feature
by the selected metrics and this becomes the most important selection and variant-based ensemble classification. The
step in the bug prediction. The method becomes more difficult framework's findings are contrasted with those of other popular
as the number of metrics in the model increases. The inclusion supervised classifiers from academic studies. A. Panichella et
of pointless measurements can significantly reduce accuracy. al.[17] prepare cases for present techniques in defect prediction
and are trained on tasks that are unrelated to their intended use,
Software development challenges represent a learning process they may not perform to their full potential. While the true goal is
that varies depending on the conditions and the stages of to rate them and make affordable forecasts, current approaches
development in which we find and can easily detect the based on statistical models are trained to find the best match to
problem. Through Fig. 1, it has been shown how the data estimate the raw number of flaws in artifacts.
development process is carried out on three levels i.e. Level 1,
Level 2, and Level 3. At Level 1, data filtration and extraction III. PROPOSED MODEL
have been performed, and then all of the extracted data is
dissected into Training, Testing, and Validation on the data in Regression is one of the machine learning techniques for
Level 2, and Level 3, which provides the actual notable data determining the relationship between relevant variables;
specifically, regression allows for the selection of the curve that cyclomatic complexity
best fits the available data. Many regression techniques[18] are 19. MOA Measure of Aggregation
available for resolving the engineering problem. The goal of the 20. LOC Lines of Code
regression is to reduce the total squared errors (least squares)[9]. 21. Bug Bug
In OPABP model entire process is divided into four different
stages which are shown in Fig. 2 wherein the first stage the data
is acquired from a reputed promise repository and in the next B. DATA PREPROCESSING
stage the data preprocessing which can be achieved by applying The generalization performance of ML algorithm is
some feature selection technique next stage is working with the frequently influenced by the data preprocessing. One of the
data modelling that can be achieved with the data metrics and in most challenging inductive ML challenges is the removal of
the last data visualization is performed. noise instances[20]. Another frequently addressed concern in
data preprocessing is missing data handling. It is generally
best to determine well-known data preprocessing methods
are like data normalization, feature selection, and training &
Testing of data. Feature selection forms the foundation for
ML, it contributes to feature measure or assessment criterion
in data model[21]. Boruta [22] deals with the issue of
increasing the system's randomization. The basic concept is
pretty straightforward: simply duplicate the system using
randomization, combine it with the original, and then
develop a classifier for this expanded system[23]. Then
contrast it with that of the randomized variables to determine
the variable's significance in the original system. Only
variables are considered important if their importance
exceeds that of the randomized variables[24].
After implementing Boruta in OPABP on dataset matrices
that help in selecting and finding important variables which are
shown in Fig. 3 to Fig. 9 for all datasets used in the paper it is
seen that all the datasets are having a different number of
important variables i.e. In ANT dataset 13 matrices are selected,
Camel 1.6 dataset 12 matrices are selected, Iucene dataset 16
data matrices s got selected, Poi3 dataset 14 data matrices got
selected, synapse dataset 10 data matrices are selected, Tomcat
Fig. 2 Working layout used in OPABP. dataset only 7 data matrices got selected and Velocity dataset 11
data matrices got selected. After the feature selection process,
A. DATA ACQUISITION the training and testing are done by the random sampling
For the purpose of research, Bug dataset[19] is used, method and take a ratio of bugged and not bugged instances,
where 20 metrics training an ML algorithm to predict labels from characteristics,
i.e.WMC,MFA,DIT,CAM,NOC,IC,CBO,CBM,RFC,AMC, tweaking it for the business need, and verifying it on outlier data
LCOM,Ca,LCOM3,Ce,NPM,Max_cc,DAM,Avg_CC,MOA are all part of the modeling process. Training and testing ratio of
,LOC are used as the features (i.e., independent variables) 80:20 has been taken into consideration and this always helps in
and the metric “bug” is used as the response or dependent enhancing the learning procedure.
variable a detailed illustration of the same is given in
TABLE 1 . The features variables (i.e., independent
variables) are 20 metrics, and the response (or dependent)
parameter is the number of bugs.
C. DATA MODELING
Regression algorithms are being used in the research to get the
expected outcomes from the existing data. Different regression
algorithms used in this paper are Neural Network (NN),
Support Vector Machine (SVM), Decision Tree (DT) and
Cubist with the help of all the above algorithms results are
evaluated.
[1] Y. M. Goh, C. U. Ubeynarayana, K. L. X. Wong, and B. H. W. [22] M. B. Kursa and W. R. Rudnicki, “Feature selection with the boruta
Guo, “Factors influencing unsafe behaviors: A supervised learning package,” J. Stat. Softw., vol. 36, no. 11, pp. 1–13, 2010.
approach,” Accid. Anal. Prev., vol. 118, pp. 77–85, 2018.
[23] C. Selvaraj, N. Bhalaji, and K. B. Sundhara Kumar, “Empirical
[2] C. L. Philip Chen and S. R. LeClair, “Integration of design and study of feature selection methods over classification algorithms,” Int. J. Intell.
manufacturing: solving setup generation and feature sequencing using an Syst. Technol. Appl., vol. 17, no. 1/2, p. 98, 2018.
unsupervised-learning approach,” Comput. Des., vol. 26, no. 1, pp. 59–75,
1994. [24] M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta - A
system for feature selection,” Fundam. Informaticae, vol. 101, no. 4, pp.
271–285, 2010.