(ICSE2012) ) Where Should The Bugs Be Fixed More Accurate Information Retrieval
(ICSE2012) ) Where Should The Bugs Be Fixed More Accurate Information Retrieval
Research Collection School Of Computing and School of Computing and Information Systems
Information Systems
6-2012
Hongyu ZHANG
Tsinghua University
David LO
Singapore Management University, [email protected]
Part of the Information Security Commons, and the Software Engineering Commons
Citation
ZHOU, Jian; ZHANG, Hongyu; and LO, David. Where should the bugs be fixed? More accurate information
retrieval-based bug localization based on bug reports. (2012). ICSE 2012: 34th International Conference
on Software Engineering, Zurich, June 2-9. 14-24.
Available at: https://ink.library.smu.edu.sg/sis_research/1531
This Conference Proceeding Article is brought to you for free and open access by the School of Computing and
Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for
inclusion in Research Collection School Of Computing and Information Systems by an authorized administrator of
Institutional Knowledge at Singapore Management University. For more information, please email
[email protected].
Where Should the Bugs Be Fixed?
More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports
Abstract—For a large and evolving software system, the often prolonged, maintenance cost is increased and customer
project team could receive a large number of bug reports. satisfaction rate is hampered.
Locating the source code files that need to be changed in order In recent years, some researchers have applied
to fix the bugs is a challenging task. Once a bug report is information retrieval techniques to automatically search for
received, it is desirable to automatically point out to the files relevant files based on bug reports [16, 25, 31, 32]. They
that developers should change in order to fix the bug. In this treat an initial bug report as a query and rank the source code
paper, we propose BugLocator, an information retrieval based files by their relevance to the query. The developers can then
method for locating the relevant files for fixing a bug. examine the returned files and fix the bug. These methods
BugLocator ranks all files based on the textual similarity
are information retrieval based bug localization methods.
between the initial bug report and the source code using a
revised Vector Space Model (rVSM), taking into consideration
Unlike spectrum-based fault localization techniques [1, 18,
information about similar bugs that have been fixed before. 19, 22, 23], information retrieval (IR) based bug localization
We perform large-scale experiments on four open source does not require program execution information (such as
projects to localize more than 3,000 bugs. The results show that passing and failing traces). They locate the bug-relevant files
BugLocator can effectively locate the files where the bugs based on initial bug reports.
should be fixed. For example, relevant buggy files for 62.60% Many of the existing IR-based bug localization methods
Eclipse 3.1 bugs are ranked in the top ten among 12,863 files. are proposed in the context of feature/concept location, using
Our experiments also show that BugLocator outperforms a small number of selected bug reports [16, 24, 31]. For
existing state-of-the-art bug localization methods. example, Poshyvanyk et al. proposed a feature location
method called PROMESIR, which utilizes an information-
Keywords-bug localization; information retrieval; feature retrieval technique (Latent Semantic Indexing) and a
location; bug reports probabilistic ranking technique [31]. They applied their
method to locate 3 bugs in Eclipse and 5 bugs in Mozilla.
I. INTRODUCTION Gay et al. proposed an approach to augment IR-based
Software quality is vital for the success of a software concept location via an explicit relevance feedback (RF)
project. Although many software quality assurance activities mechanism [16]. They applied their bug localization
(such as testing, inspection, static checking, etc) have been approach on 9 bug reports. Recently, Lukins et al. performed
proposed to improve software quality, in reality software a study on applying LDA (Latent Dirichlet Allocation) to
systems are often shipped with defects (bugs). For a large search for bug-related methods and files [25]. They used 322
and evolving software system the project team could receive bugs across 25 versions of three projects (Eclipse, Mozilla
a large number of bug reports over a long period of time. For and Rhino) for the evaluation. In each version, only a small
example, around 4414 bugs were reported for the Eclipse number of bugs were selected (less than 20 on average).
project in 2009. Besides the problem of small-scale evaluations, the
Once a bug report is received and confirmed, the project performance of the existing bug localization methods can be
team should locate the source code files that need to be further improved too. For example, using Latent Dirichlet
changed in order to fix the bug. However, it is often costly to Allocation (LDA), only buggy files for 22% of Eclipse 3.1
manually locate the files to be changed based on the initial bug reports are ranked in the top 10 [25]. More detailed
bug reports, especially when the numbers of files and reports discussions about the current methods and their limitations
are large. For a large project consisting of hundreds or even are given in the next section.
thousands of files, manual bug localization is a painstaking In this paper, we propose BugLocator, a new method that
and time-consuming activity. As a result, the bug fix time is can automatically search for relevant buggy files based on
initial bug reports. We propose a revised Vector Space
Model (rVSM) to rank all source code files based on an
*
corresponding author initial bug report. In rVSM, we take the document length into
15
Query Construction: Bug localization considers a bug ! Small-scale experiments: Many of the existing static
report as a query, and uses it to search for relevant files in the bug localization methods only used a small number of
indexed source code corpus. It extracts tokens from the bug selected bug reports in their evaluation.
title and description, removes stop words, stems each word,
and forms the query. III. THE PROPOSED APPROACH
Retrieval and Ranking: Retrieval and ranking of A. Analysis of Bug Localization Problem
relevant buggy files is based on the textual similarity
between the query and each of the files in the corpus. To improve bug localization performance, we leverage
Various approaches can be used to compute a relevance the following observations:
score for each file in the corpus given an input bug report. Source code files: A project’s source code repository
contains source code files. As illustrated in Figure 1, source
C. Information Retrieval Models Used in Exisiting Bug code files may contain words that are similar to those
Localization Methods occurring in the bug reports. Therefore, analyzing source
Many bug localization approaches have been proposed. code files can help determine the location where the bug has
These approaches mainly differ in the retrieval and ranking impact on, i.e., the buggy files.
of the results. There are many retrieval and ranking models Similar bugs: Once a new bug report is received, we
that have been used in prior studies on IR-based bug can examine similar bugs that were reported and fixed
localization. Due to space constraint, we just briefly describe before. The information on locations where past similar
some important ones here: bugs were fixed could help us locate the relevant files for
SUM: Smoothed Unigram Model (SUM) is a statistical the new bug.
model that fits a single multinomial distribution to the Software size: When two source files have similar
frequencies of words in each file in the corpus [27]. The scores, we need to determine which one should be ranked
unigram model (UM) derived directly from the word higher. From our experiences in software defect prediction
frequency counts may have some problems, especially when [37] and from other people’s work on quantitative analysis
confronted with words that have not explicitly been seen of fault distributions [14, 29], we know that statistically,
before - the probabilities of that unseen words are zero. SUM larger files are more likely to contain bugs. Therefore for
smoothes the probability distributions by assigning non-zero bug localization we need to assign higher scores to larger
probabilities to the unseen words [17, 36]. SUM was used for files.
bug localization by Rao and Kak [32] and was found to be The source code file information has been used by
the best performing model. existing bug localization methods [16, 25, 31, 32]. However,
LDA: Latent Dirichlet Allocation (LDA) is a generative
to our best knowledge, the information about similar bugs
probabilistic model for collections of discrete data such as
text corpora [11]. It is a Bayesian model, which extracts and software sizes has not been well utilized. During the
latent topics from a collection of documents. Each topic is a design of our approach, we take these information into
collection of tokens with attached probabilities. Each consideration.
document is represented by a probabilistic mixture of topics.
It was used by Lukins et al. [25] for bug localization.
Index
LSI: Latent Semantic Indexing (LSI) [12], also called Source Code Indexing
Files
latent semantic analysis (LSA), is an indexing and retrieval
method that can identify the relationship between the terms Retrieval & Ranked Files
Ranking with (rVSMRank)
and concepts contained in an unstructured collection of text rVSM
File Size
by using mathematical techniques such as Singular Value
Decomposition (SVD). This method was used by New Bug Query
Poshyvanyk et al. for bug localization [30, 31]. Report Construction
Query
16
bugs that have been fixed before, and rank the relevant files In the similar way, we obtain the vector of term weights
by analyzing past similar bugs and their fixes. Finally, we !!" !!"
for the query Vq and its norm Vq .
combine the ranks obtained from the query on source code
files and from the analysis of past similar bugs, and return Classical VSM favours small documents during ranking.
the users the combined ranks. The users can then examine Long documents are often poorly represented because they
the returned files in descending order to locate the bug. We have poor similarity values [15]. According to previous
describe the detailed procedures in the following subsections. studies [14, 29, 37], larger source code files tend to have
higher probability of containing a bug. Therefore we should
C. Ranking Based on Source Code Files rank larger files higher in the case of bug localization. We
We consider source code files as a text corpus, and the thus define a function g (Equation 5) to model the document
initial bug report as a query. We can then apply information length in rVSM:
retrieval techniques to create a model for searching source
1
code files based on the bug report. The similarity between g (# terms ) " & N (# terms )
(5)
each file and the bug report is computed. The files are then 1# e
ranked by the similarity values and returned as output. Equation (5) is a logistic function (i.e., an inverse logit
We propose a revised Vector Space Model (rVSM) to function) that ensures that larger documents are given higher
index and rank the source code files. In a classic VSM, the scores during ranking. We use Equation (5) to compute the
relevance score between a document d and a query q is length value for each source file according to the number of
computed as the cosine similarity between their terms the file contains.
corresponding vector representations: In Equation (5), we use the normalized value of #terms as
!!" !!" the input to the exponential function e & x . The normalization
Vq ! Vd
Similarity ( q, d ) " cos( q, d ) " !!" !!" (1) function is defined as follows:
Vq Vd Suppose that X is a set of data, xmax and xmin are the
!!" !!" maximum and minimum data in X, the normalization value
, where Vd and Vq are a vector of term weights for the of any x in X is:
!!" !!"
document d and query q, respectively. Vq ! Vd represents the x & xmin
N ( x) " (6)
inner product of the two vectors. The term weight w is xmax & xmin
computed based on the term frequency (tf) and the inverse Combining the above analysis, we thus propose a new
document frequency (idf). The basic idea is that the weight of scoring algorithm for rVSM as follows:
a term in a document is increasing with its occurrence
frequency in this specific document and decreasing with its rVSMScore(q, d ) " g (# term) % cos(q, d )
occurrence frequency in other documents. In classic VSM, tf
and idf are defined as follows: 1 1
" %
1 # e& N (# terms ) # docs
tf (t , d ) "
f td
, idf (t ) " log(
# docs
) (2) ! t$q ((log ftq # 1) % log( n ))2
# terms nt t
1 (7)
, where ftd refers to the number of occurrences of a term t in %
document d, nt refers to the number of documents that # docs
contain the term t, #terms represents the total number of ! t$d ((log ftd # 1) % log( n ))2
t
terms in document d, and #docs represents the total number 2
of documents in the corpus. Over the years, many variants of " # docs #
tf(t,d) have been proposed to improve the performance of the %! t$q ' d (log f tq # 1) % (log ftd # 1) % log $ %
VSM model [26]. These include logarithm, augmented, and & nt '
Given a bug report, we use Equation (7) to determine the
Boolean variants of the classic VSM. It is observed that the relevance scores (rVSMScore) between each source code file
logarithm variant can lead to better performance [8, 13]: and the bug report. A ranked list (rVSMRank) can be
tf (t , d ) " log( f td ) # 1 (3) obtained according to the scores (the first returned result has
the highest score).
In rVSM, we use Equation (3) to define tf. Thus in
Equation (1), each term weight w in the document D. Ranking Based on Similar Bugs
!!" !!"
vector Vd and its norm Vd are calculated as follows: For a new bug report, we also examine similar bugs that
have been fixed before in order to adjust the rankings of the
# docs relevant files. The assumption here is that similar bugs tend
wt$d " tf td % idft " (log ftd # 1) % log( )
nt to fix similar files. We propose a method for ranking relevant
(4) files based on similar bugs as follows:
!!" # docs
Vd " ! t$d ((log ftd # 1) % log( n ))2 We first construct a three-layer heterogeneous graph as
shown in Figure 3. The top layer (layer 1) contains one node
t
representing a newly reported bug B. The second layer
17
contains nodes representing previously fixed bugs S that are IV. EXPERIMENTAL SETUP
similar to B. In our approach, we do not enforce a similarity
threshold. A link between B and a bug in layer 2 indicates A. Subject Systems
that there is a non-zero similarity value between their bug To evaluate the effectiveness of BugLocator, we use four
reports. The third layer contains nodes representing all open source projects as shown in Table I. All projects have
source code files F. If a bug in layer 2 is fixed in a file in complete bug database and change history, and have
layer 3, a link between them is established, indicating that different numbers of bugs and source code files. We choose
the bug has impact on the file. Eclipse3 in our evaluation because it is a well-known large-
scale open source system and it is widely used in empirical
Layer 1 B (a bug to be located)
software engineering research. The AspectJ project is a part
A link represents the similarity of the iBUGs public dataset provided by the University of
between Si and B Saarland4 [9, 10]. It is also the subject used for evaluating
various IR models for bug localization [32]. Both Eclipse
Layer 2 Ă S (all similar bugs of B)
and AspectJ use the Bugzilla bug tracking system and the
A link indicates the impact of CVS/SVN version control system. We also investigate the
a bug on a file SWT5 component of Eclipse, to evaluate the bug
Layer 3 F (source code files)
performance at the subproject level. To further evaluate the
generality of our approach, we choose an Android project
Figure 3. Heterogeneous Bug-File graph
ZXing6, which is maintained by Google’s bug tracking
system and version control system.
The weight of each node in layer 2 (Si) represents the
degree of similarity between Si and the newly reported bug B. Data Collection
B. This similarity is computed by Equation (1). The weight For each subject system, we collect its initial bug reports
of each node in layer 3 (Fj) represents the degree of from the bug tracking system (such as BugZilla). To evaluate
relevance between a source code file and the bug B, which the bug localization performance, we only collect the bug
is computed as follows: reports of fixed bugs.
To establish the links between bug reports and source
SimiScore " !
All Si that
( Similarity ( B, Si ) / ni ) (8) code files, we adopt the traditional heuristics proposed by
connect to F j Bachmann and Bernstein [5]:
1) Scan through the change logs for bug IDs in a given
, where Si is a node in layer 2 that connects to Fj, ni is the format (e.g. “issue 681”, “bug 239” and so on).
total number of connections to layer 3 Si has (i.e., the 2) Exclude all false-positive bug numbers (e.g. “r420”,
number of files that are modified to fix the bug Si). “2009-05-07 10:47:39 -0400” and so on).
After computing the SimiScore for each file using 3) Check if there are other potential bug number formats
Equation (8), we then rank all files based on the SimiScore or false positive number formats, add the new formats and
values, and obtain a ranking of relevant files SimiRank. scan the change logs iteratively.
E. Combining Ranks 4) Check if potential bug numbers exist in the bug-
tracking database with their status marked as fixed.
Having computed the scores obtained from querying Based on these heuristics we mine the source code
source code files (rVSMScore) and from similar bug analysis repository (such as CVS and SVN) for links between source
(SimiScore), we then combine these two scores for each file code files and bug reports.
as follows:
FinalScore " (1 & ( ) % N (rVSMScore) TABLE I. THE STUDIED PROJECTS
(9)
# ( % N ( SimiScore) Project Description
Study #Fixed #Source
Period Bugs Files
, where ( is a weighting factor and 0 İ ( İ 1.The Eclipse An open development Oct 2004 -
FinalScore is a weighted sum of rVSMScore and SimiScore. 3075 12863
(v3.1) platform for Java Mar 2011
The source code files ranked by FinalScore in descending SWT An open source widget Oct 2004 -
98 484
order are returned to users (FinalRank). Files that are ranked (v3.1) toolkit for Java Apr 2010
higher are the more relevant ones, i.e., more likely to An aspect-oriented
Jul 2002 -
AspectJ extension to the Java 286 6485
contain the newly reported bug B. programming language
Oct 2006
Before we combine rVSMScore and SimiScore, we A barcode image
normalize them to the range of 0 to 1, using the Mar 2010-
ZXing processing library for 20 391
Sep 2010
normalization function defined in Equation (6). Android applications
The parameter ( adjusts the weights of the two rankings.
The value of ( can be set empirically, our experience shows 3
http://www.eclipse.org
when ( is between 0.2 and 0.3, the proposed method 4
http:// www.st.cs.uni-saarland.de/ibugs/
5
performs the best. http://www.eclipse.org/swt/
6
http://code.google.com/p/zxing/
18
C. Research Questions 0.1 and £ to 0.1, which can lead to a better
Our experiments are designed to address the following performance. We use JGibbLDA7, an open source tool
research questions: written in Java, to implement the LDA model.
! SUM, which was used by Rao and Kak [32] for bug
RQ1: How many bugs can be successfully located by localization. In their study, SUM is shown to be the best
BugLocator? IR model for bug localization, outperforming
To answer this question, we run BugLocator on the four sophisticated models like LDA and LSI.
subject systems as described in Section IV.A. For each bug ! VSM, which was also used by Rao and Kak [32] for
report, we first obtain the relevant files that have been bug localization. In their study, VSM was the second
modified to fix the bug using the method described in best IR approach for bug localization.
Section IV.B. We then check the ranks of these files in the ! LSI, which was used by Poshyvanyk et al. [30, 31] for
query results returned by BugLocator. If the files are ranked bug localization. Previous experiments [25, 32] show
in top 1, top 5 or top 10, we consider the report has been that the performance of SUM, VSM or LDA is better
effectively localized. We perform the experiment for all bug than LSI.
reports and calculate the percentage of bugs that have been Following Rao and Kak [32], we use KL divergence [21]
successfully located. We also compute the Mean Average to compute the similarity measures for LDA and SUM. We
Precision (MAP) and Mean Reciprocal Rank (MRR) use the cosine similarity measure for VSM and LSI.
measures (described in Section IV.D) to further evaluate bug
localization performance. D. Evaluation Metrics
RQ2: Does the revised Vector Space Model (rVSM) improve To measure the effectiveness of the proposed bug
the bug localization performance? localization method, we use the following metrics:
In Section III, we propose rVSM, a revised vector space ! Top N Rank, which is the number of bugs whose
model (Equation 7) for retrieving relevant files from source associated files are ranked in the top N (N= 1, 5, 10) of
code repository. rVSM adjusts the ranks of large files and the returned results. Given a bug report, if the top N
incorporates a more effective term-frequency variant. To query results contain at least one file at which the bug
evaluate the effectiveness of rVSM, we perform bug should be fixed, we consider the bug located. The
localization on the subject systems using classic and revised higher the metric value, the better the bug localization
VSM, and compare the results. performance.
RQ3: Does the consideration of similar bugs improve the
! MRR (Mean Reciprocal Rank), which is a statistic for
evaluating a process that produces a list of possible
bug localization performance?
responses to a query [34]. The reciprocal rank of a
In Section III, we propose to use similar bugs to adjust
the ranks obtained by rVSM. To evaluate the usefulness of query is the multiplicative inverse of the rank of the
the proposed similar bug analysis, we perform bug first correct answer. The mean reciprocal rank is the
localization on the four subject systems with/without the average of the reciprocal ranks of results of a set of
rankings learned from past similar bugs. Furthermore, queries Q:
according to Equation (9), the parameter ( adjusts the 1
Q
1
weights of the two rankings. When ( = 0, the final rank is MRR "
Q
! rank
i "1
(10)
i
only dependent on the queries of source code files. When the
value of ( is between 0 and 1, the final rank is a The higher the MRR value, the better the bug
combination of two ranking results. In our experiments, we localization performance.
also evaluate the effect of different ( values. ! MAP (Mean Average Precision), which provides a
RQ4: Can BugLocator outperform other bug localization single-figure measure of quality of information
methods? retrieval [26], when a query may have multiple
Bug localization has attracted much research interest in relevant documents. The Average Precision of a single
recent years. In our experiments, we compare BugLocator to query (AvgP) is the average of the precision values
the bug localization methods implemented using the obtained for the query, which is computed as follows:
following IR techniques: M P ( j ) % pos ( j )
! LDA, which was used by Lukins et al. [25] for bug AvgP " ! (11)
i number of positive instances
localization. Following their LDA configuration, in our i "1
experiment, for AspectJ, SWT and ZXing, we set K , where j is the rank, M is the number of instances
(the number of topics) to 100, ( (the hyper-parameter retrieved, pos(j) indicates whether the instance in the
for the per-document topic distribution) to 0.5 (this is rank j is relevant or not. P(j) is the precision at the
the default value computed by the standard formula: given cut-off rank j and is defined as follows:
50/K), and £ (the hyper-parameter for the per-topic
word distribution) to 0.1, For Eclipse, as it is a large
system consisting of many files, we set K to 500, ( to 7
http://jgibblda.sourceforge.net
19
number of positive instances in top j positions For ZXing, BugLocator achieves similar good
P( j) " (12) performance. The percentages of bugs whose relevant files
j
are ranked top 1, top 5, and top 10 are 40%, 60%, and 70%,
Then the MAP for a set of queries is the mean of the respectively.
average precision values for all queries. In bug In summary, the experimental results show that
localization, a bug may be relevant to multiple files. BugLocator can help locate a large percentage of bugs by
We use MAP to measure the average performance of examining a small number of source files.
BugLocator for locating all relevant files. The higher 70
the MAP value, the better the bug localization BugLocator SUM
performance. 60
V. EXPERIMENTAL RESULTS 50
Percentage
A. Experimental Results for Research Questions 40
20
RQ3: Does the consideration of similar bugs improve the RQ4: Can BugLocator outperform other bug localization
bug localization performance? methods?
Table IV below shows the experimental results of bug We implement bug localization methods using VSM,
localization without using information from similar bugs (i.e., LDA, SUM and LSI models and perform experiments on all
the weighting factor ( is 0). Comparing Table II and Table subject systems. We then compare the performance of
IV, we can see that the information of similar bugs can BugLocator with the related methods. Figure 6 shows the
indeed improve the bug localization performance. For percentage of bugs that can be located in top 1 and top 10
example, for the Eclipse project, utilizing similar bugs we returned files. Clearly, BugLocator outperforms all other
can locate relevant source files at top 1 for 896 bugs methods. For example, using BugLocator we can locate
(29.14%), within top 10 for 1925 bugs (62.60%). The MRR 29.14% Eclipse bugs in the first returned (top 1) files, while
and MAP values are 0.41 and 0.30, respectively. Without using VSM, LDA, SUM and LSI models, we can only locate
considering similar bugs, only 749 bugs (24.36%) have their 6.86%, 0.32%,1.72% and 4.23% Eclipse bugs in the first
relevant files ranked as the top 1, and 1719 bugs (55.90%) returned files, respectively. BugLocator also outperforms
have their relevant files ranked within top 10. The MRR and other models when the performance is measured in terms of
MAP values are only 0.35 and 0.26, respectively. Similar MAP and MRR. For example, for ZXing, the MAP and
results are observed for other projects as well. MRR values are 0.44 and 0.50 respectively, which are much
TABLE IV. THE PERFORMANCE OF BUG LOCALIZATION WITHOUT USING
higher than the second best model (i.e., SUM), whose MAP
SIMILAR BUGS and MRR values are 0.30 and 0.37, respectively. Detailed
results are omitted due to space constraints. The t-tests at
System Top 1 Top 5 Top 10 MRR MAP 95% confidence level confirm that our method statistical
8 11 14 significantly outperforms the others.
ZXing 0.48 0.41
(40%) (55%) (70%)
B. Discussions of the Results
31 64 76
SWT 0.47 0.40 1) Why does the proposed rVSM method work?
(31.63%) (65.31%) (77.55%)
AspectJ
65 117 159
0.33 0.17
Our experimental results described in the previous
(22.73%) (40.91%) (55.59%) section show that the proposed rVSM performs better than
749 1419 1719 the classical VSM when used for bug localization. In this
Eclipse 0.35 0.26
(24.36%) (46.15%) (55.90%)
section, we discuss why the proposed rVSM can achieve
We also evaluate the impact of similar bug information better performance.
on bug localization performance, with different ( values. The differences between rVSM and VSM are in the
We find that at beginning, the bug localization performance Equations (4) and (5). Equation (4) uses the logarithm of the
increases when the ( value increases. However, after a original tf value. This is because terms with high frequency
certain point, further increase of the ( value will decrease may have negative impact on information retrieval
the performance. As an example, Figure 5 below shows the performance. It is often not the case that the term importance
bug localization performance (measured in terms of MAP is proportional to its occurrence frequency. The logarithm
and MRR) for the Eclipse project. When the ( value variant of tf can help smooth the impact of the high frequent
terms [8, 13].
increases from 0 to 0.3, both MAP and MRR values
Equation (5) adjusts the ranking results based on file
increases. Increasing ( value further from 0.4 to 0.9
sizes. This is based on the findings of our earlier study [37]
however leads to lower performance. When ( is between 0.2 that the larger files tend to be more defect-prone than the
and 0.3, we obtain the best bug localization performance. smaller files.
MRR MAP
0.55 0.5
eclipse eclipse
aspectj aspectj
0.5 swt 0.45 swt
zxing zxing
0.45 0.4
0.4 0.35
0.35 0.3
0.3 0.25
0.25 0.2
0.2 0.15
0.15 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
21
Top 1 Top 10
50 100
45 BugLocator VSM LDA SUM LSI 90 BugLocator VSM LDA SUM LSI
40 80
35 70
Percentage
Percentage
30 60
25 50
20 40
15 30
10 20
5 10
0 0
Eclipse AspectJ SWT Zxing Eclipse AspectJ SWT Zxing
TABLE V. THE COMPARISONS OF DIFFERENT LENGTH FUNCTIONS These results suggest that similar bugs can improve the bug
localization performance.
Length
function g
Expression MAP MRR The analysis of similar bugs becomes more important
when the textual similarity between bug reports and source
1 code is low. As an example, for the Eclipse bug 89014 that
Logistic f ( x) " 0.26 0.35
1 # e& x is reported on March 24, 2005, it was fixed in the file
Exponential f ( x) " ex & 0.5 0.25 0.33 BindingComparator.java. Using rVSM, the relevant file
BindingComparator.java is only ranked 2527, because the
1 textual similarity between source code and the bug report is
Square root f ( x) " 1 & 0.19 0.26
x low. However, the analysis on similar bugs found that this
bug is actually similar to previous fixed bugs 83817, 79609
x and 79544, which all introduced bug-fixing changes to the
Linear # 0.5f ( x) "
0.25 0.34
2 file BindingComparator.java. Therefore, BugLocator
In [37], we found that a small number of largest files combines the scores obtained from rSVM and similar bug
account for a large proportion of the defects. For example, in analysis based on Equation (9), and the final rank of the file
Eclipse 3.0, 20% of the largest files are responsible for BindingComparator.java becomes 7.
62.29% pre-release defects and 60.62% post-release defects. 3) The percentage of code to be examined for bug
Similar phenomenon is also observed by many others localization
include Ostrand et al. [29]. They studied the “ranking ability” Our experimental results reported in the previous sections
of LOC for two industrial systems and found that 20% of only evaluate the performance of bug localization in terms of
largest files contain 73% and 74% of the bugs for the two the number of relevant files retrieved. In practices,
systems. In summary, the empirical studies confirm that by developers are also interested in the actual lines of code need
ranking larger files higher we can locate more bugs. to be examined in order to locate a bug. This is of particular
Equation (5) uses a logistic function g to adjust the concern as the proposed rVSM model ranks larger source
ranking results. We also experiment with other length files higher via the length function defined in Equation (5).
functions including linear, square root and exponential We perform further experiments to evaluate how many
functions (Table V). These functions weight files of different lines of code are required to be examined in order to locate
sizes differently. Our experiment results (Table V) show that the bugs. For each bug, we count the number of files to be
the logistic function achieves the best overall MAP and examined before locating the bug, and compute the lines of
MRR values, outperforming other length functions. code for each file. The results show that BugLocator is still
2) Why can similar bugs help improve bug localization effective when its performance is measured in terms of lines
performance? of code to be examined. For example, by examining 1% lines
We also explore why similar bugs can improve the bug of code, BugLocator can locate nearly 80% bugs in Eclipse
localization. We find out that for many bugs, the associated and 60% bugs in AspectJ.
files have overlaps with the associated files of their similar BugLocator can also locate more bugs than SUM when
bugs. For example, in Eclipse, 1207 (39.3%) bugs have at the same number of lines of code is examined. For Eclipse,
least one relevant file that is common to the files of their top using BugLocator we can locate more than 95% bugs by
10 most similar bugs. For 602 (19.6%) bugs, all their examining 10% of code, while using SUM (the best
relevant files are covered by their top 10 most similar bugs. performing method described in [32]) we can only locate
22
about 81% bugs by examining the same amount of code. For problem of locating bug-related code could be also treated as
the other systems, we obtain similar results. a feature/concept location problem. Poshyvanyk et al. [31]
presented a feature location method called PROMESIR,
VI. THREATS TO VALIDITY which combines results from both dynamic analysis and
There are potential threats to the validity of our work: information retrieval. They applied PROMESIR to locate 8
! All datasets used in our experiments are collected from bugs in Mozilla and Eclipse systems. Gay et al. [16] also
open source projects. The nature of the data in open proposed a concept location approach that augments
source projects may be different from those in projects information retrieval based concept location via an explicit
developed by well-managed software organizations. relevance feedback mechanism. They evaluated their
We need to evaluate if our solution can be directly approach using 7 Eclipse bug reports. Our approach is
applied to commercial projects. We leave this as a dedicated to bug localization. We perform large-scale
future work. evaluations using more than 3000 bug reports from four
different systems. Unlike the work described in [31] and
! A limitation of our approach is that we rely on good
[16], we do not require program execution or user interaction.
programming practices in naming variables, methods
Our work is also related to research on mining software
and classes. If a developer uses non-meaningful names
repository. The existence of large amount of data stored in
the performance of bug localization would be affected.
bug tracking systems provides many opportunities for
However, in our experiments we notice that in most
automated software quality analysis and improvement. Many
well-managed projects, developers generally follow
researchers mine bug report information to solve software
good naming conventions.
engineering problems such as duplicate bug detection [33,
! Bug reports provide crucial information for developers 35], automatic bug triage [4, 7], bug report quality analysis
to fix the bugs. A “bad” bug report could cause a delay [6, 7], and defect prediction [20, 37]. Because of the large
in bug fixing. Our approach also relies on the quality of number of bugs, such problems cannot be effectively solved
bug reports. If a bug report does not provide enough by manual efforts. In our approach, we utilize bug report
information, or provides misleading information, the information to automatically locate buggy files.
performance of BugLocator is adversely affected.
VIII. CONCLUSIONS
VII. RELATED WORK
Once a new bug report comes, developers need to know
Bug fixing is an important but still costly activity in which files should be modified to fix the bug. For a large
software development. Spectrum-based fault localization software project, they may need to examine a large number
techniques [1, 18, 19, 22, 23] can help developers locate of source code files in order to locate the bug, which could
faults by examining a small portion of code. These be a tedious and costly work. In this paper, we have
techniques usually contrast the program spectra information proposed an IR-based method named BugLocator for
(such as execution statistics) between passed and failed locating relevant source code files based on initial bug
executions to compute the fault suspiciousness of individual reports. BugLocator utilizes a revised Vector Space Model
program elements (such as statements, branches, and (rVSM) as well as similar bug information. The evaluation
predicates), and rank these program elements by their fault results on four real-world open source projects show that
suspiciousness. Developers may then locate faults by BugLocator can perform bug localization effectively. The
examining a list of program elements sorted by their results also show that BugLocator outperforms existing
suspiciousness. Examples of spectrum-based fault methods such as those based on VSM, LDA, LSI, and SUM.
localization techniques include Tarantula [18, 19], Jaccard In future, we will explore if program execution
and Ochiai [1]. The spectrum-based fault localization information can be integrated into our approach to help
techniques require program runtime execution traces. Our further improve bug localization performance. We will also
approach is based on the query of bug reports against the apply BugLocator to industrial projects to evaluate its
source code repository, which does not require the collection effectiveness in practice.
of the passing and failing execution traces. There are also Our tool and the experimental data are available at:
other techniques that help developers automatically locate http://code.google.com/p/bugcenter
bugs, such as delta debugging [39] and dynamic slicing [38].
Unlike these techniques, our approach is a static approach, ACKNOWLEDGMENT
which does not require the execution of the programs. This work is supported by NSFC grant 61073006 and
In recent years, many information retrieval based bug Tsinghua University project 2010THZ0. We thank Rongxin
localization methods have been proposed [25, 28, 32]. As Wu and Aihui Zhou for helping with data collection.
described in the previous sections, BugLocator performs
better than the related methods because of the utilization of REFERENCES
rVSM and similar bug information. This area of work is also [1] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. van Gemund, A
closely related to feature/concept location [2, 3, 24, 28], practical evaluation of spectrum-based fault localization. Journal of
which is about identifying the parts of the source code that Systems and Software, 82(11), p. 1780-1792, 2009.
correspond to a specific functionality. The results can be [2] G. Antoniol, G. Canfora, G. Casazza, and A. Lucia, Identifying the
used as starting points in change impact analysis. The Starting Impact Set of a Maintenance Request: A Case Study, Proc.
23
Fourth European Conf. Software Maintenance and Reeng. [22] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, M. Jordan. Scalable
(CSMR ’00), Zurich, Switzerland, p. 227-231, March 2000. statistical bug isolation. In Proceedings of the 2005 ACM SIGPLAN
[3] G. Antoniol and Y. Gue´he´neuc, Feature Identification: A Novel Conference on Programming Language Design and Implementation
Approach and a Case Study, Proc. 21st IEEE Int’l Conf. Software (PLDI 2005), Chicago, IL, USA, p.15-26, June 2005.
Maintenance (ICSM ’05), Budapest, Hungary, p.357-366, Sept 2005. [23] C. Liu, L. Fei, X. Yan, S. P. Midkiff, J. Han. Statistical debugging: a
[4] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In hypothesis testing-based approach. IEEE Transactions on Software
ICSE ’06: Proceedings of the 28th international conference on Engineering, 32 (10), 831-848.
Software engineering, p. 361–370, Shanghai, China, May 2006. [24] D. Liu, A. Marcus, D. Poshyvanyk, V. Rajlich, Feature Location via
[5] A. Bachmann and A. Bernstein. Software process data quality and Information Retrieval based Filtering of a Single Scenario Execution
characteristics: a historical view on open and closed source projects. Trace, in Proceedings of 22nd IEEE/ACM International Conference
IWPSE-Evol '09 Proceedings of the joint international and annual on Automated Software Engineering (ASE 2007), Atlanta, Georgia,
ERCIM workshops on Principles of software evolution (IWPSE) and November 5-9, p. 234-243.
software evolution (Evol) workshops, ACM, 2009. [25] S. Lukins, N. Kraft and L. Etzkorn. Bug localization using latent
[6] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Dirichlet allocation. Information and Software Technology ,Volume
Zimmermann. What makes a good bug report? In Proceedings of the 52, Issue 9, p. 972-990, September 2010.
16th International Symposium on Foundations of Software [26] C. D. Manning, P. Raghavan and H. Schütze. Introduction to
Engineering, Atlanta, GA , November 2008. Information Retrieval, Cambridge University Press, 2008.
[7] N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim., Duplicate [27] C. D. Manning, H Schütze, Foundations of Statistical Natural
bug reports considered harmful... really? In Proceedings of the 24th Language Processing, MIT Press, 1999.
IEEE International Conference on Software Maintenance, Beijing, [28] A.Marcus, A. Sergeyev, V.Rajlich, J. Maletic, An Information
China, September 2008. Retrieval Approach to Concept Location in Source Code. In
[8] W. B. Croft, D. Metzler, T. Strohman, Search Engines: Information Proceedings of the 11th IEEE Working Conference on Reverse
Retrieval in Practice , Addison-Wesley, 2010. Engineering (WCRE 2004), Delft, The Netherlands, p. 214-223,
[9] V. Dallmeier, T. Zimmermann. Extraction of Bug Localization November 9-12, 2004.
Benchmarks from History. In Proceedings of the 22nd IEEE/ACM [29] T. Ostrand, E. Weyuker and R. Bell, Predicting the Location and
International Conference on Automated Software Engineering (ASE Number of Faults in Large Software Systems, IEEE Trans. Software
2007), Atlanta, Georgia, USA, November 2007. Eng., 31 (4), pp. 340-355, 2005.
[10] V. Dallmeier, T. Zimmermann. Automatic Extraction of Bug [30] D. Poshyvanyk, Y.-G. Gueheneuc, A. .Marcus, G. Antoniol,
Localization Benchmarks from History. Technical Report, Saarland V.Rajlich, Combining Probabilistic Ranking and Latent Semantic
University, June 2007. Indexing for Feature Identification, Proceedings of the 14th IEEE
[11] M. B. David, Y. Ng. Andrew, M. I. Jordan. Latent Dirichlet International Conference on Program Comprehension (ICPC 2006),
Allocation, Journal of Machine Learning Research, vol. 3, p. 993- Athens, Greece, p.137-146, June 2006.
1022, 2003. [31] D. Poshyvanyk, Y. Guéhéneuc, A. Marcus, G. Antoniol and V.
[12] S. Deerwester,et al, Improving Information Retrieval with Latent Rajlich, Feature Location using Probabilistic Ranking of Methods
Semantic Indexing, Proceedings of the 51st Annual Meeting of the based on Execution Scenarios and Information Retrieval, IEEE
American Society for Information Science 25, p.36–40, 1988. Transactions on Software Engineering, p. 420-432, 33(6), 2007.
[13] S. T. Dumais,Improving the retrieval of information from external [32] S. Rao and A. Kak. Retrieval from software libraries for bug
sources, Behavior Research Methods, Instruments, and Computers, localization: a comparative study of generic and composite text
Psychonomic Society, p.229 - 236, 1991. models. In Proceeding of the 8th working conference on Mining
software repositories (MSR'11), ACM, Waikiki, Honolulu, Hawaii,
[14] N. Fenton and N. Ohlsson, Quantitative Analysis of Faults and p.43-52, May 2011.
Failures in a Complex Software System, IEEE Trans. Software Eng.,
26 (8), pp. 797-814, 2000. [33] C. Sun, D. Lo, X. Wang, J. Jiang, and S.C. Khoo. A discriminative
model approach for accurate duplicate bug report retrieval, Proc. of
[15] E. Garcia, Description, Advantages and Limitations of the Classic
the 32nd ACM/IEEE International Conference on Software
Vector Space Model, Oct 2006, available at: Engineering (ICSE’10). Cape Town, South Africa, May 2010.
http://www.miislita.com/term-vector/term-vector-3.html
[34] E. M. Voorhees, TREC-8 Question Answering Track Report,
[16] G. Gay, S. Haiduc, A. Marcus and T. Menzies, On the use of Proceedings of the 8th Text Retrieval Conference, p. 77–82, 1999.
relevance feedback in IR-based concept location, Proc. the 25th
IEEE International Conference on Software Maintenance, Edmonton, [35] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. An approach to
Alberta, Canada, p.351-360, September 2009. detecting duplicate bug reports using natural language and execution
information. In Proceedings of the 30th International Conference on
[17] I. J. Good, The population frequencies of species and the estimation Software Engineering (ICSE’08), Leipzig, Germany, May 2008.
of population parameters. Biometrika,40(3 and 4) , p.237-264, 1953
[36] I. H. Witten and T. C. Bell, The Zero-frequency Problem: Estimating
[18] J. A. Jones and M. J. Harrold. Empirical evaluation of the Tarantula
the Probabilities of Novel Events in Adaptive Text Compression,
automatic fault-localization technique. In Proceedings of the 20th IEEE Transcations on information Theory, 37(4), p.1085-1094,1991.
IEEE/ACM International Conference on Automated Software
Engineering (ASE 2005), Long Beach, California, p. 273-282, 2005. [37] H. Zhang, An Investigation of the Relationships between Lines of
Code and Defects, Proc. the 25th IEEE International Conference on
[19] J. A. Jones, M. J. Harrold, J. Stasko. Visualization of test Software Maintenanc (ICSM’09), Edmonton, Canada, p. 274-283,
information to assist fault localization. In Proceedings of the 24th September 2009.
International Conference on Software Engineering (ICSE 2002),
Orlando, Florida, USA, ACM Press, p. 467-477, May 2002. [38] X. Zhang, H. He, N. Gupta, and R. Gupta. Experimental evaluation
[20] S. Kim, T. Zimmermann, E. Whitehead Jr., A. Zeller, Predicting of using dynamic slices for fault location. In Automated and
Faults from Cached History, Proc. ICSE’07, Minneapolis, USA, Algorithmic Debugging (AADEBUG), Monterey, California, USA , p.
May 2007. 33–42, 2005.
[21] S. Kullback, K. P. Burnham, N. F. Laubscher, G. E. Dallal, L. [39] A. Zeller, R. Hildebrandt, Simplifying and isolating failure-inducing
Wilkinson, D. F. Morrison, M. W. Loyer, B. Eisenberg, et al. input, IEEE Transactions on Software Engineering 28 (2), p. 183–
Letter to the Editor: The Kullback–Leibler distance. The American 200, 2002.
Statistician 41 (4), p. 340–341, 1987.
24