A Generalisability Study of Context-Aware Query Reformulation For IR-Based Bug Localization
A Generalisability Study of Context-Aware Query Reformulation For IR-Based Bug Localization
Abstract—Software bug localization is the process of identify- program executions [16] [18]. Despite the promising results
ing the source code location of a defect in a software system. Bug achieved by existing bug localization techniques [24] [25],
localization is a crucial and challenging task in software main- there is still room for improvement in terms of effectiveness
tenance. Identifying the root cause of a bug is a time-consuming
and challenging task, especially in large software systems. While and efficiency.
there are different areas of research for bug localization, IR- As we can see from some recent qualitative and empirical
based bug localization techniques have shown promising results studies [17] [25] which reported two major limitations of IR-
in several studies, with some researchers reporting significant based Bug localization techniques. First, IR-based techniques
improvements in the accuracy of bug localization compared to cannot perform well without the presence of rich structured
other techniques. Some of the current research areas for IR-
based bug localization include Query formulation and expansion, information (e.g., program entity names pointing to defects) in
Cross-project bug localization, Analysis of unstructured data, the bug reports. Second, they also might not perform well with
Integration with other bug localization techniques etc. However, a bug report that contains excessive structured information
assessing the generalizability of such studies still remain a chal- (e.g., stack traces, Figure 1) [25]. One possible explanation
lenge as most of the studies are conducted with specific dataset of these limitations could be that most of the contemporary
but the performance is not validated over multiple datasets.
In this study, we tried to address this gap by assessing the IR-based techniques [12] [35] [24] [36] [42] [37] [15] use
generalizability of one state-of-art IR based bug localization tech- almost verbatim texts from a bug report as a query for bug
nique with three different public datasets namely BLIZZARD [1], localization. That is why, they do not perform any meaningful
BugL [2] and Bench4BL [3], containing diverse bug reports in modification to the query except a limited natural language
different formats. The localization results with unseen datasets pre-processing (e.g., stop word removal, token splitting, stem-
are comparable with other methods but overall accuracy is not
so high as reported in the original benchmark dataset used in ming). As a result, their query could be either noisy due to
the reference study. We have also implemented a novel approach excessive structured information (e.g., stack traces) or poor
to aggregate the queries from multiple key phrase extraction due to the lack of relevant structured information (e.g., Figure
and text summarizing methods which performed quite well 2). One way to overcome the above challenges is to (a) refine
and demonstrated 5%-11% improvement in terms of HIT@10, the noisy query (e.g., Figure 1) using appropriate filters and
MRR@10 and MAP@10 metrics. The reformulated aggregated
query achieved similar improvements compared to the baseline (b) complement the poor query (e.g., Figure 2) with relevant
queries as well. We also identified some shortcomings in bug search terms.
report classification method of the baseline BLIZZARD study
and suggested some improvements to them.
Index Terms—Bug localization, Information retrieval, Empir-
ical study, Generalization Study, Software debugging
I. I NTRODUCTION
Bug localization [20] [25], which aims to locate the lo-
cations in source code that are responsible for causing ob-
served bugs, is an essential and challenging task in soft-
ware maintenance [6] [20] [11]. Several techniques have Fig. 1. A Noisy Bug Report
been proposed to address this problem, including information
retrieval (IR)-based and spectrum-based techniques [12]. IR- To solve this problem Rahman et al. (2020) propose a
based techniques treat bug reports as queries and source code technique BLIZZARD [1] that locates software bugs from
as documents and rank the source code files according to their source code by employing context-aware query reformulation
similarity to the query [14] [15]. Spectrum-based techniques, and information retrieval. Their technique (1) first determines
on the other hand, analyze the execution trace of the program the quality (i.e., the prevalence of structured entities or lack
and calculate the similarity between the faulty and correct thereof) of a bug report (i.e., query) and classifies it as either
history [42] [37], code change history [44] [48], and bug
reporter history [46]. The recent study of Wang and Lo [46]
is one of the most successful in combining bug report content
with three external repositories to outperform five previous IR-
based bug localization techniques [26] [42] [37] [15], making
Fig. 2. A Poor Bug Report it the current state-of-the-art. In short, contemporary studies
advocate for combining (1) multiple localization approaches
(e.g., dynamic trace analysis [12], Deep learning [40], learning
noisy, rich or poor, (2) then applies appropriate reformulation to rank [47]) and (2) multiple external information sources
to the query, and (3) finally uses the improved query for with classic IR-based localization, and thus, improve the
the bug localization with information retrieval. Unlike earlier localization performances. However, such solutions could be
approaches [26] [36] [37], it either refines a noisy query or costly (i.e., multiple repository mining) and less scalable (i.e.,
complements a poor query for effective information retrieval. dependency on external information sources), and hence, could
Thus, BLIZZARD has a high potential for improving IR-based be infeasible to use in practice. That’s why most of the recent
bug localization. The initial evaluation of Blizzard on a limited studies focus on better leveraging the potential of the resources
data set of 6 eclipse projects showed that it outperformed at hand (i.e., bug report and source code) which are the two
several state-of-the-art bug localization techniques. core factors for IR-based bug localization. The latest studies
However, it is still unclear how effective Blizzard is when [33] [32] [23] [45] focus on reformulating the bug report
applied to new or unseen bug reports from large software queries by extracting important terms from bug reports, and
projects. In this study, we aim to evaluate the effectiveness also some studies try to use source code corpus in order to
of Blizzard on a large dataset of bug reports from popular augment the bug report queries. However, these approaches
projects of two well-known datasets “Bench4BL” and “BugL”. mostly deal with unstructured natural language texts. [52]
Specifically, we conduct a generalisability study to answer the proposed a word embedding-based Automatic Query Expan-
following research questions: sion technique where word are embedded from both the
• How well does the Blizzard technique generalize to a new global corpus and project-specific-corpus. The initial query is
and unseen dataset? extended by adding words semantically similar to it based
• How does Blizzard perform compared to several other on vector representations from the embedding model. The
state-of-the-art Keyword extraction techniques in terms Study validated the effectiveness of WEQE by using 4,583
of query reformulation? bug reports from seven projects, four IRBL models, and two
• Can Blizzard performance be improved by aggregat- embedding models.
ing with other state-of-the-art Keyword extraction tech- [53] Investigate the impact of combining various IR meth-
niques? ods on the retrieval accuracy of bug localization engines and
• What are the characteristics of the bug reports that impact shows that optimized IR-hybrids can significantly outperform
the performance of Blizzard? individual methods as well as other unoptimized methods
The answers to these research questions can provide insights and hybrid methods achieve their best performance when
into the effectiveness and limitations of Blizzard and can guide utilizing information-theoretic IR methods. [54] proposed a
future research on improving bug localization techniques. deep learning-based model that is composed of an enhanced
convolutional neural network (CNN) that considers bug-fixing
II. L ITERATURE R EVIEW recency and frequency, together with word-embedding and
Bug localization is a well-established research area, with feature-detecting techniques and makes full use of semantic
existing studies broadly categorized into two groups: spectra- information.
based and information retrieval (IR)-based [12] [25]. While Among other studies on Query Reformulation, [55] pro-
spectra-based techniques are costly and lack scalability [14] posed a method for bug localization with word embedding
[25], most of the recent studies adopt IR-based methods such and enhanced convolutional neural networks while [56] used
as Latent Semantic Indexing (LSI) [41], Latent Dirichlet Allo- Word Embedding for expanding queries for IR-based Bug
cation (LDA) [35] [24], and Vector Space Model (VSM) [39] Localization. [58] combined query reduction and expansion
[14] [36] [42] [15] for bug localization. They use the shared for text-retrieval-based bug localization. [59] proposed and
vocabulary between bug reports and source code entities for evaluates a set of query reformulation strategies based on
bug localization. However, several studies [17] [25] show that the selection of existing information in bug descriptions, and
IR-based methods are subject to the quality of bug reports the removal of irrelevant parts from the original query. [60]
and can be costly and less scalable when combined with other proposed three query reformulation strategies that require the
techniques or external information sources. In recent times, users to simply select from the bug report the description of the
several studies have combined conventional IR-based bug software’s observed behavior and/or the bug title, and combine
localization with additional techniques such as spectra-based them to issue a new query. [61] proposed NQE (Neural Query
analysis [12], machine learning [40] [47], and mining of vari- Expansion), a neural model that takes in a set of keywords
ous repositories such as bug report history [26], version control and predicts a set of keywords to expand the query to NCS
(neural Code search). NQE learns to predict keywords that
co-occur with the query keywords in the underlying corpus,
which helps expand the query in a productive way.
[57] proposed a classification model for classifying a bug
report as either uninformative or informative before running an
IR-based bug location system. The model is based on implicit
features learned from bug reports that use neural networks and
explicit features defined manually.
Some other studies focused on bug localization by utilizing
execution traces. Pathidea [50] leverages logs in bug reports to
re-construct execution paths and helps improve the results of
bug localization. It uses static analysis to create a file-level call Fig. 3. Dataset summary
graph, and re-constructs the call paths from the reported logs.
Similar approach was taken by [51] to propose a scalable bug
code files (e.g., Java classes) were changed or no relevant
localization technique using by reconstructing the execution
source files exist in the collected system snapshot. Then we
paths.
converted the datasets into such a format that will work as an
For our study, we considered the blizzard technique [1]
input into the Blizzard tool and also for other methods we will
which approaches the problem differently using several ideas
use for comparison.
from above mentioned studies. In particular, they refine the
noisy queries (i.e., containing stack traces) and complement • Raw Bug Reports: At first we parse the “.xml/.json/.txt”
the poor queries (i.e., lacks structured items), and offer an bug report files to extract individual bug reports and
effective information retrieval, unlike the earlier studies. Thus, stored them as “.txt” files with bug-id as a file name. Each
issues raised by low-quality bug reports [25] have been sig- bug report will contain, Bug-ID, Title, and Description as
nificantly addressed by their technique, and their experimental content.
findings support such conjecture. They compare with three • Goldset Development: We collect a changeset (i.e., list of
existing studies including the state-of-the-art [58]. However, changed files) from each of our selected bugs which were
they have used only one dataset to check their performance, collected in the datasets from each bug fixing commit,
and not much generalizability study was performed on their and develop a goldset where we created a “.txt” file for
tool by using new and unseen datasets. each bug report with bug-id as a file name containing the
In this study, we tried to check the generalizability of Bliz- changeset paths of each report. Multiple changesets for
zard [1] with multiple publicly available datasets containing the same bug were merged together.
bug reports from various bug repositories and in different • Lucene-Index: As Blizzard used the Lucene index for
formats. We also proposed a an approach to generate aggregate searching through documents so we have created the
queries by combining BLIZZARD with several text keyphrase Lucene index for each project by using the source code
extraction and summarizing methods to improve overall bug corpus of that project. At first, we prepossessed the
localization performance. corpus ”.java” files and then we build the Lucene index.
The prepossessing includes Stop word removal, Java
III. E XPERIMENT D ETAILS Keywords removal and punctuation removal. We used
A. Dataset Lucene version 6.2.0 [49] as it’s used in the Blizzard
Since one of our main research goals is to test the generaliz- tool. (Lucene index is a data structure that stores textual
ability of state-of-the-art IR-based bug localization techniques, information in a way that makes it easy to search and
therefore, in our study, we have used three different datasets retrieve documents.)
namely blizzard [5], bench4BL [3], and bugL [2] datasets from • Lucene-Index to File-Mapping: We created a mapping file
previous studies. Where Blizzard contains 5,139 unique bug for the Lucene index which contained the index of each
reports from 6 Java projects from GitHub, Bench4BL consists file and its path with the filename. Which will be used
of 10,017 bug reports collected from 51 open source projects while generating results.
and BuGL has 10,187 bug reports collected from 54 open • Corpus: Finally we renamed all the files according to their
source projects. Figure 3 presents an overview of the datasets: mapping index and stored them in one directory for each
project.
Dataset Collection: We collect a total of 25343 bug reports
from three different datasets. The datasets were taken from B. Bug Report Classification:
earlier studies consecutively Blizzard [5], Bench4BL [3], and Like the benchmark studies, we adopt a semi-automated
BuGL [2]. First, all the resolved (i.e., marked as RESOLVED approach in classifying the bug reports (i.e., the queries). For
or “closed”) bug reports of each project were collected from each bug report, regular expression is used to classify it into
the datasets. In order to ensure a fair evaluation, we also one of the three categories ST, PE or NL. But in case of
discard such bug reports from our dataset for which no source many bug reports, this automated step fails due to ill-defined
structures of the report and we determine the class by manual combination of lexical and semantic information to iden-
analysis. This is especially true when there are multiple types tify the most important words and phrases in a document.
of information like method invocation/program elements and At first, they classify the into three classes(e.g. BRST,
stack traces/natural language in the bug report. The regular BRPE, BRNL) based on their structured elements or
expressions fail to correctly classify many such cases and we lack thereof, then they apply appropriate reformulations
manually verify and classify them. Given the explicit nature of to them. In particular, they analyze either bug report
the structured entities, human developers can identify the class contents or the results retrieved by them from relevant
easily. The contents of each bug report are considered as the source code, employ graph-based term weighting, and
initial queries which are reformulated in the next few steps. then identify important keywords from them for query
We categorize the reports into the following three categories. reformulation.
• Bug Report Stack Traces(BRST): If a bug report contains • TextRank: TextRank [7] is a graph-based ranking al-
one or more stack traces besides the regular texts or gorithm that can identify important terms in a textual
program elements, it is classified into BRST. We apply document based on their relationships with other terms.
the following regular expression [14] to locate the trace By analyzing the relationships between terms in bug
entries from the report content. reports, TextRank identifies important terms related to the
bug and suggests query reformulations that improve the
Regex for Stack Trace (ST) accuracy of bug localization.
(.*)?(.+)\\.(.+)(\\((.+)\\.java:\\d+\\)| • TopicRank: This technique [27] is used to identify the
\\(Unknown Source\\)|\\(Native Method\\)) main topics in a given text and extract the most important
keywords related to those topics. It uses graph-based
• Bug Report Program Elements(BRPE): If a bug report algorithms that use a statistical approach based on topic
contains one or more program elements (e.g., method modeling techniques to identify clusters of related words
invocations, package names, source file names) but no and phrases and rank them based on their relevance to
stack traces in the texts, then it is classified into BRPE. the overall content. The extracted keywords are used to
We use appropriate regular expressions [19] to identify reformulate the query and ensure that the most important
the program elements from the texts. topics are covered.
• YAKE (Yet Another Keyword Extractor): YAKE [29]
Regex for Program Elements (PE)
((\\w+)?\\.[\\s\\n\\r]*[\\w]+)[\\s\\n\\r]* is a keyword extraction algorithm that uses statistical
(?=\\(.*\\))|([A-Z][a-z0-9]+){2,} and linguistic features to identify the most important
keywords or phrases in a text. It uses statistical algorithms
• Bug Report Natural Language(BRNL): If a bug report to identify the most important keywords and rank them
contains neither any program elements nor any stack based on their relevance to the overall content. In the
traces, it is classified into BRNL. That is, it contains only context of bug localization, YAKE is used to extract
unstructured natural language descriptions of the bug. keywords or phrases from a bug report, which can then be
used to reformulate a query for searching relevant source
C. Query Reformulation code files. By extracting relevant keywords, YAKE helps
Query Reformulation is a process of refining a user’s developers to better understand the bug and locate the
original query to improve the accuracy and relevance of relevant code files more efficiently.
search results. In the context of IR-based bug localization, • PageRank: It is also a graph-based approach [30] [31] that
Query Reformulation [13] [23] [32] [34] aims to enhance uses algorithms similar to those used by search engines
the precision of bug localization by improving the accuracy to identify the most relevant keywords and phrases. It
of the bug reports query used to search for relevant code constructs a graph where nodes represent unique words in
files or in other words location of the bug. The process of the text and edges represent the co-occurrence of words
Query Reformulation involves analyzing the original query and in the same sentence. The PageRank algorithm is then
identifying keywords and phrases that are most relevant to the used to compute a score for each word, which indicates
task at hand. These keywords are then used to reformulate its relative importance in the text. The top n words with
the query in a way that will yield more accurate and relevant the highest scores are returned as the extracted keywords
results. The most effective way to extract keywords for Query which are used for query reformulation.
Reformulation is to use techniques such as Blizzard, TextRank, • Skip-gram: It is another machine-learning approach [33]
TopicRank, Yake, PageRank, and Skip-gram. These techniques that is commonly used for keyword extraction. It is a type
are designed to identify the most important words and phrases of neural network that is used for language modeling and
in a given document or corpus and rank them according to their word embedding. It works by training a model to predict
relevance. The following techniques we have considered for the context of a given word, based on the other words
our experimental study: that appear around it. We used the source code corpus
• Blizzard technique: The Blizzard technique [1] is a graph- to train the Skip-gram model then we used it to identify
based approach [38] to keyword extraction that uses a relevant terms or keywords that are related to the bug
report, which was then used to reformulate the query for E. Performance Metrics
the bug localization process. In order to have the comparability with the benchmark
• Baseline Queries: In this technique [1], we have just used study [1], we used similar metrics namely Hit@K, Mean
a pre-processed version of bug reports. To prepare base- Average Precision@K (MAP@K), Mean Reciprocal Rank@K
line queries for our datasets we performed Punctuation (MRR@K) and Effectiveness (E).
removal, Stop word removal, and Splitting of complex Hit@K: It is defined as the percentage of queries for
tokens. which at least one buggy file (i.e., from the goldset) is
For all the mentioned techniques above at first, we correctly returned within the Top-K results. It is also called
preprocessed the bug reports by removing stop words, Recall@Top-K [36] and Top-K Accuracy [45] in literature.
punctuation, and Splitting of complex tokens. Then used Mean Average Precision@K (MAP@K): Unlike regular
the algorithm to generate reformulated queries. precision, this metric considers the ranks of correct results
D. Bug Localization within a ranked list. Precision@K calculates precision at the
Code Search: To perform the code search we used Lucene occurrence of each buggy file in the list. Average Precision@K
[13] [43]. Lucene is a widely adopted search engine for (AP@K) is defined as the average of Precision@K for all
document search that combines Boolean search and VSM- the buggy files in a ranked list for a given query. Thus,
based search methodologies (e.g., TF-IDF [28]). At first, we Mean Average Precision@K is defined as the mean of Average
create a Lucene index for each project by providing prepro- Precision@K (AP@K) of all queries as follows:
cessed (e.g. Stop word removal, Java Keywords removal, and PD
Pk ∗ buggy(k)
punctuation removal) source code files as corpus documents. AP @K = k=1 ,
Then once a query is reformulated, we submit the query to |S|
P
Lucene. In particular, we employ the Okapi BM25 similarity qϵQ AP @K(q)
from the engine, use the reformulated query for the code M AP @K =
|Q|
search, and then collect the results(top-k documents list).
These resultant and potentially buggy source code documents Here, function buggy(k) determines whether k-th file (or
are then presented as a ranked list to the developer for manual result) is buggy (i.e., returns 1) or not (i.e., returns 0), and Pk
analysis. provides the precision at k-th result. D refers to the number
of total results, S is the true positive result set of a query, and
Q is the set of all queries. The bigger the MAP@K value is,
the better a technique is [1].
Mean Reciprocal Rank@K (MRR@K): Reciprocal
Rank@K is defined as the multiplicative inverse of the rank
of first correctly returned buggy file (i.e., from gold set)
within the Top-K results. Thus, Mean Reciprocal Rank@K
(MRR@K) averages such measures for all queries in the
dataset as follows:
1 X 1
M RR@K(Q) =
|Q| f irstRank(q)
qϵQ
description of a bug report was chosen as the baseline query. baseline and combined methods performed better then BLIZ-
Lucene was selected as the baseline technique. The perfor- ZARD (3-6% higher) in HIT@10, MAP@10 and MRR@10.
mance of Lucene with the baseline queries was selected as This result significantly lower than the dataset used in the
the baseline performance in 5 for IR-based bug localization in benchmark study [1] and therefore, there generalizability of the
this study. We have considered two situations while generating benchmark study is under question and needs to be improved.
the reformulated query: Answering RQ2: Blizzard perform compared to several
• With Query Augmentation: here the reformulated query other state-of-the-art Keyword extraction technique for
for bug reports containing only natural language texts are Query reformulation Figure 6 presents the performance
further enriched with important keywords from the source of BLIZZARD to other keyword extraction methods and
code of the project using pseudo-relevance feedback and also to the aggregated techniques we have applied in our
then term based graph weighting is employed. In pseudo- experiment. Again, while tested with Blizzard own dataset,
relevance feedback, Top-K result documents, returned by Blizzard performed slightly better than the single methods like
a given query, are naively considered as relevant and PgeRank, TopicRank and YAKE but ran short compared to the
hence, are selected for query reformulation [63] [64]. aggregated techniques. While experiementd with all 3 datasets,
We have used the implementation from the benchmark all other methods performed better than Blizzard itself by an
study [1] with their own and two additional datasets average of 2%-6%.
and also generated two aggregatd queries by combining
two other popular keyword extraction techniques called
Pagerank (BLIZZARD + PageRank) and YAKE (BLIZ-
ZARD+YAKE).
• Without Query Augmentation: here we reformulate the
query based on the natural language texts from the bug
reports only and don’t complement that from source Fig. 6. Performance comparison over different Dataset
codes.
Answering RQ1: Generalizability of BLIZZARD across Answering RQ3: Improving Blizzard performance by
new unseen datasets With its own pre-trained dataset, the aggregating with other state-of-the-art Keyword extraction
benchmark technique [1] could localize 70.67% of the bugs techniques
with a mean reciprocal rank@10 of 44.79% and mean av- Figure. 8 and Figure. 7 presents a graphical comparison
erage precision@10 of 42.13%. This result is respectively of of HIT@10 (Accuracy) among different techniques we
3% (HIT@10) and 5% (MAP@10) higher than the baseline. experimented with in our study.
When we combined the methods BLIZZARD + PAGERANK As we have seen in Figure. 5 and 6, aggregated tech-
and BLIZZARD+YAKE, we achieved even better results (6% niques consistently achieved improved results in all terms
higher than BLIZZARD and 9% higher than baseline) on the of HIT@10, MRR@10 and MAP@10 metrics. This again
same BLIZZARD dataset. supports our hypothesis to use aggregated methods for query
On the contrary, when we applied the same methods, the reformulation instead of a single method itself.
Fig. 7. Accuracy comparison of reformulated queries without augmentation
V. L IMITATIONS
R EFERENCES
[1] Rahman, Mohammad Masudur, and Chanchal K. Roy. ”Improving ir-
based bug localization with context-aware query reformulation.” Pro-
ceedings of the 2018 26th ACM joint meeting on European software
engineering conference and symposium on the foundations of software
engineering. 2018.
[2] Muvva, Sandeep, A. Eashaan Rao, and Sridhar Chimalakonda. ”BuGL–
A Cross-Language Dataset for Bug Localization.” arXiv preprint
arXiv:2004.08846 (2020).
[3] Lee, Jaekwon, et al. ”Bench4bl: reproducibility study on the performance
of ir-based bug localization.” Proceedings of the 27th ACM SIGSOFT
international symposium on software testing and analysis. 2018.
[4] 2011. Stop words. https://code.google.com/p/stop-words
Accessed: April 2023.
Fig. 11. Tags are more susceptible to outdated answers. [5] BLIZZARD: Replication package. https://github.com/masud-
technope/BLIZZARD-Replication-Package-ESEC-FSE2018
Accessed: April 2023.
measurement statistics generation process also not clearly [6] Anvik, J., L. Hiew, and G. Murphy. ”Who should fix this bug.” ICSE,
2006.
described. [7] Rahman, Mohammad Masudur, and Chanchal K. Roy. ”TextRank based
We addressed these issues by manual verification and cor- search term identification for software change tasks.” 2015 IEEE 22nd
rection in our work and still our implementation is impacted International Conference on Software Analysis, Evolution, and Reengi-
neering (SANER). IEEE, 2015.
by some of them. We intend to address these limitations in [8] Ashok, Balasubramanyan, et al. ”DebugAdvisor: A recommender system
our future work. for debugging.” Proceedings of the 7th joint meeting of the European
software engineering conference and the ACM SIGSOFT symposium
VI. C ONCLUSION AND F UTURE W ORK on The foundations of software engineering. 2009.
[9] Chen, Fuxiang, and Sunghun Kim. ”Crowd debugging.” Proceedings of
We made the following contributions with out study: the 2015 10th Joint Meeting on Foundations of Software Engineering.
2015.
• We have taken a step towards generalization of the state [10] Gu, Zhongxian, et al. ”Reusing debugging knowledge via trace-based
of art IR based bug localization BLIZZARD tool over bug search.” ACM SIGPLAN Notices 47.10 (2012): 927-942.
[11] Xia, Xin, et al. ”“automated debugging considered harmful” considered
three public bug report datasets (Blizzard, Bench4BL harmful: A user study revisiting the usefulness of spectra-based fault
and BuGL) widely used across multiple other studies. localization techniques with professionals using real bugs from large
Our results revel that the performance of the benchmark systems.” 2016 IEEE International Conference on Software Maintenance
and Evolution (ICSME). IEEE, 2016.
study [1] decreases on unknown or ill-formated datasets [12] Le, Tien-Duy B., Richard J. Oentaryo, and David Lo. ”Information
and needs further enhancements in order to improve the retrieval and spectrum based bug localization: Better together.” Pro-
result. Through this process, we have developed some ceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering. 2015.
tools for dataset preparation and conversion tool which [13] Haiduc, Sonia, et al. ”Automatic query reformulations for text retrieval in
can be useful for validating the results of other studies and software engineering.” 2013 35th International Conference on Software
evaluate over a broad range of bug reports from different Engineering (ICSE). IEEE, 2013.
[14] Moreno, Laura, et al. ”On the use of stack traces to improve text
sources. retrieval-based bug localization.” 2014 IEEE International Conference
• We have proposed a novel approach to aggregate multi- on Software Maintenance and Evolution. IEEE, 2014.
ple keyword extraction techniques to reformulate query [15] Zhou, Jian, Hongyu Zhang, and David Lo. ”Where should the bugs
which demonstrated superior performance compared to be fixed? more accurate information retrieval-based bug localization
based on bug reports.” 2012 34th International conference on software
the benchmark [1] and other individual keword extraction engineering (ICSE). IEEE, 2012.
methods. We have identified and addressed few limi- [16] Abreu, Rui, Peter Zoeteweij, and Arjan JC Van Gemund. ”On the
tations of the state-of-the-art BLIZZARD and provided accuracy of spectrum-based fault localization.” Testing: Academic and
industrial conference practice and research techniques-MUTATION
some manual fix to them. We will automate these tasks (TAICPART-MUTATION 2007). IEEE, 2007.
in our future effort. [17] Rahman, Mohammad Masudur, and Chanchai K. Roy. ”Improving bug
localization with report quality dynamics and query reformulation.” Pro-
In the future, we plan to continue and improve our study ceedings of the 40th International Conference on Software Engineering:
with the following tasks: Companion Proceeedings. 2018.
[18] Keller, Fabian, et al. ”A critical evaluation of spectrum-based fault [41] Poshyvanyk, Denys, et al. ”Feature location using probabilistic ranking
localization techniques on a large-scale software system.” 2017 IEEE of methods based on execution scenarios and information retrieval.”
International Conference on Software Quality, Reliability and Security IEEE Transactions on Software Engineering 33.6 (2007): 420-432.
(QRS). IEEE, 2017. [42] Sisman, Bunyamin, and Avinash C. Kak. ”Incorporating version histories
[19] Rigby, Peter C., and Martin P. Robillard. ”Discovering essential code el- in information retrieval based bug localization.” 2012 9th IEEE working
ements in informal documentation.” 2013 35th International Conference conference on mining software repositories (MSR). IEEE, 2012.
on Software Engineering (ICSE). IEEE, 2013. [43] Mikolov, Tomas, et al. ”Distributed representations of words and phrases
[20] Parnin, Chris, and Alessandro Orso. ”Are automated debugging tech- and their compositionality.” Advances in neural information processing
niques actually helping programmers?.” Proceedings of the 2011 inter- systems 26 (2013).
national symposium on software testing and analysis. 2011. [44] Wen, Ming, Rongxin Wu, and Shing-Chi Cheung. ”Locus: Locating
[21] Wang, Qianqian, Chris Parnin, and Alessandro Orso. ”Evaluating the bugs from software changes.” Proceedings of the 31st IEEE/ACM
usefulness of ir-based fault localization techniques.” Proceedings of the International Conference on Automated Software Engineering. 2016.
2015 international symposium on software testing and analysis. 2015. [45] Rahman, Mohammad Masudur, and Chanchal K. Roy. ”STRICT: Infor-
[22] Mens, Tom, et al. ”Predicting bugs from history.” Software evolution mation retrieval based search term identification for concept location.”
(2008): 69-88. 2017 IEEE 24th International Conference on Software Analysis, Evolu-
tion and Reengineering (SANER). IEEE, 2017.
[23] Kevic, Katja, and Thomas Fritz. ”Automatic search term identification
[46] Wang, Shaowei, and David Lo. ”Amalgam+: Composing rich infor-
for change tasks.” Companion Proceedings of the 36th International
mation sources for accurate bug localization.” Journal of Software:
Conference on Software Engineering. 2014.
Evolution and Process 28.10 (2016): 921-942.
[24] Rao, Shivani, and Avinash Kak. ”Retrieval from software libraries
[47] Ye, Xin, Razvan Bunescu, and Chang Liu. ”Learning to rank relevant
for bug localization: a comparative study of generic and composite
files for bug reports using domain knowledge.” Proceedings of the 22nd
text models.” Proceedings of the 8th Working Conference on Mining
ACM SIGSOFT international symposium on foundations of software
Software Repositories. 2011.
engineering. 2014.
[25] Wang, Qianqian, Chris Parnin, and Alessandro Orso. ”Evaluating the [48] Youm, Klaus Changsun, et al. ”Bug localization based on code change
usefulness of ir-based fault localization techniques.” Proceedings of the histories and bug reports.” 2015 Asia-Pacific Software Engineering
2015 international symposium on software testing and analysis. 2015. Conference (APSEC). IEEE, 2015.
[26] Saha, Ripon K., et al. ”On the effectiveness of information retrieval [49] Apache Lucene https://lucene.apache.org/core/downloads.html
based bug localization for c programs.” 2014 IEEE international confer- Accessed: April 2023.
ence on software maintenance and evolution. IEEE, 2014. [50] Chen, An Ran, Tse-Hsun Chen, and Shaowei Wang. ”Pathidea: Im-
[27] Bougouin, Adrien, Florian Boudin, and Béatrice Daille. ”Topicrank: proving information retrieval-based bug localization by re-constructing
Graph-based topic ranking for keyphrase extraction.” International joint execution paths using logs.” IEEE Transactions on Software Engineering
conference on natural language processing (IJCNLP). 2013. 48, no. 8 (2021): 2905-2919.
[28] Sparck Jones, Karen. ”A statistical interpretation of term specificity and [51] Pradel, Michael, Vijayaraghavan Murali, Rebecca Qian, Mateusz
its application in retrieval.” Journal of documentation 28.1 (1972): 11- Machalica, Erik Meijer, and Satish Chandra. ”Scaffle: bug localization
21. on millions of files.” In Proceedings of the 29th ACM SIGSOFT
[29] Campos, Ricardo, et al. ”YAKE! Keyword extraction from single docu- International Symposium on Software Testing and Analysis, pp. 225-
ments using multiple local features.” Information Sciences 509 (2020): 236. 2020.
257-289. [52] Kim, Misoo, Youngkyoung Kim, and Eunseok Lee. ”A Novel Au-
[30] Brin, Sergey, and Lawrence Page. ”The anatomy of a large-scale tomatic Query Expansion with Word Embedding for IR-based Bug
hypertextual web search engine.” Computer networks and ISDN systems Localization.” In 2021 IEEE 32nd International Symposium on Software
30.1-7 (1998): 107-117. Reliability Engineering (ISSRE), pp. 276-287. IEEE, 2021.
[31] Haveliwala, Taher H. ”Topic-sensitive pagerank.” Proceedings of the [53] Khatiwada, Saket, Miroslav Tushev, and Anas Mahmoud. ”On com-
11th international conference on World Wide Web. 2002. bining ir methods to improve bug localization.” In Proceedings of the
[32] Hill, Emily, Lori Pollock, and K. Vijay-Shanker. ”Automatically cap- 28th International Conference on Program Comprehension, pp. 252-262.
turing source code context of nl-queries for software maintenance and 2020.
reuse.” 2009 IEEE 31st International Conference on Software Engineer- [54] Lam, An Ngoc, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.
ing. IEEE, 2009. Nguyen. ”Bug localization with combination of deep learning and in-
[33] Ye, Xin, et al. ”From word embeddings to document similarities for formation retrieval.” In 2017 IEEE/ACM 25th International Conference
improved information retrieval in software engineering.” Proceedings of on Program Comprehension (ICPC), pp. 218-229. IEEE, 2017.
the 38th international conference on software engineering. 2016. [55] Xiao, Yan, Jacky Keung, Kwabena E. Bennin, and Qing Mi. ”Improving
bug localization with word embedding and enhanced convolutional
[34] Chaparro, Oscar, and Andrian Marcus. ”On the reduction of verbose
neural networks.” Information and Software Technology 105 (2019):
queries in text retrieval based software maintenance.” Proceedings of
17-29.
the 38th International Conference on Software Engineering Companion.
[56] Kim, Misoo, Youngkyoung Kim, and Eunseok Lee. ”A Novel Au-
2016.
tomatic Query Expansion with Word Embedding for IR-based Bug
[35] Nguyen, Anh Tuan, et al. ”A topic-based approach for narrowing the
Localization.” In 2021 IEEE 32nd International Symposium on Software
search space of buggy files from a bug report.” 2011 26th IEEE/ACM
Reliability Engineering (ISSRE), pp. 276-287. IEEE, 2021.
International Conference on Automated Software Engineering (ASE
[57] Fang, Fan, John Wu, Yanyan Li, Xin Ye, Wajdi Aljedaani, and Mohamed
2011). IEEE, 2011.
Wiem Mkaouer. ”On the classification of bug reports to improve bug
[36] Saha, Ripon K., et al. ”Improving bug localization using structured localization.” Soft Computing 25 (2021): 7307-7323.
information retrieval.” 2013 28th IEEE/ACM International Conference [58] Florez, Juan Manuel, Oscar Chaparro, Christoph Treude, and Andrian
on Automated Software Engineering (ASE). IEEE, 2013. Marcus. ”Combining query reduction and expansion for text-retrieval-
[37] Wang, Shaowei, and David Lo. ”Version history, similar report, and based bug localization.” In 2021 IEEE International Conference on
structure: Putting them together for improved bug localization.” Proceed- Software Analysis, Evolution and Reengineering (SANER), pp. 166-
ings of the 22nd International Conference on Program Comprehension. 176. IEEE, 2021.
2014. [59] Chaparro, Oscar, Juan Manuel Florez, and Andrian Marcus. ”Using
[38] Blanco, Roi, and Christina Lioma. ”Graph-based term weighting for bug descriptions to reformulate queries during text-retrieval-based bug
information retrieval.” Information retrieval 15 (2012): 54-92. localization.” Empirical Software Engineering 24 (2019): 2947-3007.
[39] Kim, Dongsun, et al. ”Where should we fix this bug? a two-phase [60] Chaparro, Oscar, Juan Manuel Florez, Unnati Singh, and Andrian
recommendation model.” IEEE transactions on software Engineering Marcus. ”Reformulating queries for duplicate bug report detection.” In
39.11 (2013): 1597-1610. 2019 IEEE 26th international conference on software analysis, evolution
[40] Lam, An Ngoc, et al. ”Bug localization with combination of deep and reengineering (SANER), pp. 218-229. IEEE, 2019.
learning and information retrieval.” 2017 IEEE/ACM 25th International [61] Liu, Jason, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri,
Conference on Program Comprehension (ICPC). IEEE, 2017. and Satish Chandra. ”Neural query expansion for code search.” In
Proceedings of the 3rd acm sigplan international workshop on machine
learning and programming languages, pp. 29-37. 2019.
[62] Kim, Misoo, and Eunseok Lee. ”A novel approach to automatic query
reformulation for ir-based bug localization.” In Proceedings of the
34th ACM/SIGAPP Symposium on Applied Computing, pp. 1752-1759.
2019.
[63] Carpineto, Claudio, and Giovanni Romano. ”A survey of automatic
query expansion in information retrieval.” Acm Computing Surveys
(CSUR) 44, no. 1 (2012): 1-50.
[64] Kevic, Katja, and Thomas Fritz. ”Automatic search term identification
for change tasks.” In Companion Proceedings of the 36th International
Conference on Software Engineering, pp. 468-471. 2014.