0% found this document useful (0 votes)
73 views11 pages

BugSwarm Benchmark Analysis

1. BugSwarm is a benchmark containing 3,091 pairs of failing and passing builds extracted from GitHub projects, designed to evaluate fault localization and program repair techniques. 2. The authors analyzed BugSwarm and found that only 112 pairs (3.6%) are actually suitable for evaluating fault localization and program repair, as most failures are due to non-behavioral issues rather than bugs. 3. The paper provides lessons for designing benchmarks for fault localization and program repair, and characterizes BugSwarm's content in terms of build failures, diffs, and patch types to guide researchers on appropriate usage.

Uploaded by

Jerry Cvem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views11 pages

BugSwarm Benchmark Analysis

1. BugSwarm is a benchmark containing 3,091 pairs of failing and passing builds extracted from GitHub projects, designed to evaluate fault localization and program repair techniques. 2. The authors analyzed BugSwarm and found that only 112 pairs (3.6%) are actually suitable for evaluating fault localization and program repair, as most failures are due to non-behavioral issues rather than bugs. 3. The paper provides lessons for designing benchmarks for fault localization and program repair, and characterizes BugSwarm's content in terms of build failures, diffs, and patch types to guide researchers on appropriate usage.

Uploaded by

Jerry Cvem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Critical Review of BugSwarm for Fault

Localization and Program Repair


Thomas Durieux, Rui Abreu
INESC-ID and IST, University of Lisbon, Portugal

Abstract—Benchmarks play an essential role in evaluating (3,091/101,265) of the pairs of builds they considered. They
the efficiency and effectiveness of solutions to automate several ended up with 1,827 pairs of Java builds and 1,264 pairs of
phases of the software development lifecycle. Moreover, if well Python builds that are between one and four years old. Those
designed, they also serve us well as an important artifact to
builds are extracted from 108 Java projects and 52 Python
arXiv:1905.09375v1 [cs.SE] 22 May 2019

compare different approaches amongst themselves. BugSwarm is


a benchmark that has been recently published, which contains projects, with an average of 19.3 pairs of builds per project.
3,091 pairs of failing and passing continuous integration builds. In this paper, we characterize the human patches, the failures
According to the authors, the benchmark has been designed with and the usage of BugSwarm. We then focus our analysis on
the automatic program repair and fault localization communities
the applicability of BugSwarm in evaluating state of the art
in mind. Given that a benchmark targeting these communities
ought to have several characteristics (e.g., a buggy statement automatic fault localization and program repair research.
needs to be present), we have dissected BugSwarm to fully In our analysis, we identify that an important number of
understand whether the benchmark suits these communities well. pairs of builds are ill suited for automatic fault localization
Our critical analysis shows several limitations in the benchmark: and program repair fields. Indeed, we observe that BugSwarm
only 112/3,091 (3.6%) are suitable to evaluate techniques for
automatic fault localization or program repair. contains, e.g., duplicate commits and builds that fail due
to non-behavioral problems. Our analysis shows that only
I. I NTRODUCTION 112/3,091 (a mere 3.6%) meet the criteria of those research
Empirical software engineering focuses on gathering evi- fields. Reporting the suitable builds per programming language
dence, mainly via measurements and experiments involving shows 50 entries for Java and 62 for Python only.
software artifacts. The information collected is then used to The difference of number is explained by the fact that
form the basis of theories about the processes under study. BugSwarm is a benchmark of failing and passing builds and
Benchmarks, therefore, play an essential role in the empirical not a benchmark of bugs. Builds can fail for multiple others
studies, as, in many situations, they are the only source reasons than regression bugs. Automatic program repair and
of information to evaluate new approaches. In particular, fault localization rely on bugs for their evaluation. Using all
benchmarks are used to evaluate approaches to testing (e.g., the builds in BugSwarm would result in bad repair and fault
automatic test generation), automatic fault localization (FL), localization precision rate and misleading analysis.
and automatic program repair (APR). These two research The contributions of this paper are:
fields require benchmark of (behavioral) bugs to evaluate the
• a critical review of the BugSwarm benchmark with re-
precision of the fault localization and the ability to generate
spect to automatic fault localization and program repair;
correct patches, respectively.
• a characterization of BugSwarm’s content in terms of
In the past year, the research community put a lot of effort
build execution, diff analysis, failure type and patch type;
to develop new benchmarks for those fields. Indeed, four
• a set of lessons learned for designing future benchmarks
new benchmarks have been presented in the past months:
for automatic program repair and fault localization;
BUGSJS [1], Bears [2], Code4Bench [3] and BugSwarm [4].
• the source code of the 3,091 BugSwarm pairs of builds,
They complement the existing benchmarks: Defects4J [5],
the TravisCI’s logs for each build and the diffs between
IntroClass [6], ManyBugs [6], IntroClassJava [7], Bugs.jar [8],
the passing and failing builds;
QuixBugs [9]. As the number of benchmarks grows, it is
• an interactive website [10] to browse and filter
becoming more and more important to have a clear picture
BugSwarm’s pairs of builds.
of the characteristics of each benchmark. This is important to
ensure the quality of the evaluations that use such benchmarks This paper is an analysis of the content and usage of
and to guide researchers to the benchmarks that suit the best BugSwarm. This analysis provides guidelines for researchers
their needs. on how to use BugSwarm and prevent misusage or incorrect
In this paper, we present a critical review of the BugSwarm recommendations.
benchmark, recently published at the International Confer- The remainder of this paper is organized as follows. Sec-
ence on Software Engineering (ICSE’19) [4]. BugSwarm tion II presents the background of this paper: the BugSwarm
is a benchmark of 3,091 pairs of failing and passing benchmark and the requirements of automatic program repair
builds, designed for the automatic fault localization and pro- and fault localization tools. Section III contains our analysis of
gram repair communities. They succeed to reproduce 3.05% the BugSwarm benchmark. Section IV discusses the lessons
Mine list of Extract pair of Filter pair of
1 GitHub projects 2 Pair of builds 3
projects builds builds

Filtered pair of builds


5x

Reproduced  Reproduce Create Docker


6 BugSwarm Docker images 5 Docker images 4
Docker images images

Figure 1. Methodology followed to create the BugSwarm benchmark.

learned of this analysis and the threats to validity. Section V Table I


presents the related works and Section VI concludes this paper. T HE ORIGINAL METRICS PRESENTED IN B UG S WARM ’ S PAPER .

II. BACKGROUND Java Python All


# Pairs of Builds 1,827 1,264 3,091
A. What is BugSwarm? # Projects 108 52 160
BugSwarm is a benchmark of 3,091 pairs of builds that have
been published by Dmeiri et al. at ICSE’19 [4]. image that has been created at the previous step to ensure
The idea of BugSwarm is to collect pairs of failing/passing that the behaviors of the builds are consistent. They
builds from a continuous integration service (in this case parsed the TravisCI logs and the logs produced by the
TravisCI). The failing build contains the incorrect behavior new Docker image to extract the number of failing tests.
to fix and the passing build contains the oracle of the fix, They consider that the builds are reproduced when the
e.i. the human modification that makes the failing build pass. number of failing test is identical.
This approach is also used by Bears benchmark [2] but 6) BugSwarm. The final step is to create an infrastructure
differs from the benchmarks that are traditionally used by that makes it possible to use the BugSwarm and its 3,091
automatic program repair and fault localization fields such as pairs of builds.
Defects4J [5] and QuixBugs [9].
Table I presents the main metrics that have been presented in
Each pair of builds is composed of a failing build and
BugSwarm’s paper. BugSwarm contains 3,091 pairs of builds,
a passing builds that are encapsulated in a Docker image.
1,827 from Java applications, 1,264 from Python applications.
The Docker image provides the source code and the scripts
Those builds are coming from 160 different GitHub projects,
to reproduce the passing and failing execution. BugSwarm is
108 Java projects, 52 Python projects. The usage information
accompanied with a command line interface that downloads
is available on BugSwarm’s website: http://bugswarm.org and
and starts the Docker images as well as an infrastructure to
for further details, we recommend reading the BugSwarm
execute scripts inside the Docker image.
paper [4].
Figure 1 presents the six main steps that BugSwarm’s
authors followed to create it.
B. APR and FL Requirements
1) Mine a list of GitHub projects. The first step consists of
collecting a set of GitHub projects that used TravisCI. The current state of the art of automatic program repair and
2) Extract pair of builds. During the following step, they fault localization techniques have a set of requirements for
analyze the build history of each project and collect all the buggy programs that they receive as inputs. We identify
pairs of failing/passing builds. The complexity of this task two categories of requirements. The first category contains
resides in the ability to recreate the correct history of the the requirements related to the execution itself and the second
branches since Git history is a tree and TravisCI history category is to ensure the fairness of empirical evaluations.
is linear. They needed an additional step to match the git The requirements for automatic program repair and fault
history with the TravisCI history. localization:
3) Filter pair of builds. The next step consists of keeping 1) Bug type. Current APR and FL techniques only target
only the pairs of builds that have a chance to be repro- behavioral bugs that are present in the source code of the
ducible. In this case, they only consider the builds where program. It means that bugs in configuration or external
TravisCI uses a Docker image to run the build. files are currently not compatible with APR and FL.
4) Create Docker images. The fourth step is to create a new 2) Specification of the program. The test-suite of the appli-
Docker image that contains the two builds (failing and cation is currently being used as the main specification
passing) and the scripts that execute them. in APR and FL. The passing tests specify the correct
5) Reproduce Docker images. This fifth step is an important behavior of the application and a failing test describes
step. It consists of executing five times each Docker the incorrect behavior.
3) Program setup. The execution setup of the program has
to be known such as the path of the sources, path of
the tests, path of the binary, the classpath (for Java) and
version of the source.
4) Uniqueness of the bug. The requirement is important
to ensure that the technique does not overfit one specific
bug and consequently introduce bias in the analysis of
the results.
5) Human patch. The human patch is currently being Figure 2. The protocol that we follow to extract the buggy and passing builds
used in APR and FL evaluation as the perfect oracle. It from BugSwarm’s Docker images and the protocol to extract the metrics that
we used to answer the research questions.
provides the solution on how to fix the bug and provide
the location of the bug.
over it. For each build, we use the BugSwarm command
Those requirements define the necessary conditions to be
line to run the Docker image. Inside the Docker image, we
able to use and evaluate state-of-the-art APR and FL ap-
first prepare the buggy and passed version of the application
proaches on bugs.
by removing the temporary files that had been introduced
III. B UG S WARM A NALYSIS during the creation of BugSwarm. The temporary files are
This section contains our analysis of BugSwarm. duplicate versions of the repository. Then, we create a new
Git repository where we commit the buggy and passed version.
A. Research Questions The repository is then pushed to our GitHub repository in its
In this study, we address the following three research branch [11]. Then, we download the failing and passing log
questions: execution from TravisCI and the Docker manifest of the image
RQ1. What are the main characteristics of BugSwarm’s pairs via DockerHub.
of builds regarding the requirements for FL and APR? Once all the pairs of builds are pushed on GitHub, we
In this research question, we analyze the pairs of builds download the diffs between each failing/passing version us-
using three different axes: 1. failing builds, 2. the human ing GitHub’s API. Finally, we compute the metrics on the
patches, and 3. the failure reasons in order to identify collected artifacts such as the number of changed files, the
pairs of builds that match APR and FL requirements. number of changed lines, the types of file that have been
RQ2. What is the execution and storage cost of BugSwarm? In changed or the time between the failing and passing commits.
the second research question, we describe and analyze the All the collected artifacts, the scripts to collect and analyze
usage of BugSwarm and we estimate the execution and are publicly available on our GitHub repository [11]. The
storage cost of running an experiment with BugSwarm repository also offers the access of the source of each pair
on Amazon cloud. of build; this is a rather convenient when the execution of
RQ3. Which pairs of builds meet the requirements of Auto- the builds is not require for the analysis. We also created a
matic Program Repair (APR) and Fault Localization (FL) website that presents the human diffs and the collected metrics
techniques? In the final research question, we put in for each artifact [10].
perspective BugSwarm’s pairs of builds with the state
of the art automatic program repair and fault localization C. RQ1. Characteristics of BugSwarm’s Pairs of Builds
techniques and we identify which ones could be used by In this section, we analyze BugSwarm from three different
those fields. angles: 1) a general analysis of the pairs of builds, 2) an
B. Protocol analysis of the human patch and its diff, 3) and an analysis of
the failures. The analyses used the data that is collected with
In this section, we present the protocol that we followed the protocol described in Section III-B.
to collect the artifacts used to answer our three research 1) Buggy Builds: For the first angle, we do a general
questions. We identify the following four artifacts that need analysis of the pairs of builds of BugSwarm. Table II presents
to be collected: the main metrics of this analysis. This table is divided into four
1) The source code of each pair of builds. columns. The first column presents the name of the metric, the
2) The diff between the failing and passing build. second column contains the number for Java pairs of builds,
3) The build information from TravisCI (including the exe- the third for Python and the last one for Java and Python pairs
cution logs). of builds.
4) The Docker image information from DockerHub. Our first observation is that the number of builds reported in
Figure 2 describes our protocol. We first get all the pairs of this paper (2,949) and BugSwarm’s paper (3,091) are different
builds information using BugSwarm’s API.1 and we iterate (line one and two of Table II). Indeed, we considered all pairs
1 BugSwarm’s API request to get the list of pairs of builds:
of builds that are reproduced successfully five times like it
http://www.api.bugswarm.org/v1/artifacts/?where={”reproduce successes”: is described in BugSwarm’s paper (see Section 4-B in [4]).
{”$gt”:4,”lang”:{”$in”:[”Java”,”Python”]}}} Surprisingly, BugSwarm authors did not consider their criteria
Table II Table IV
N UMBER OF PAIRS OF BUILDS IN B UG S WARM . AVERAGE AND MEDIAN TIME FOR THE DEVELOPERS TO FIX THE BUILDS
AND THE EXECUTION TIME OF THE PASSING AND FAILING BUILDS . I T
Metrics Java Python All SHOWS THAT BUILDS ARE FIXED MUCH MORE QUICKLY THAN
TRADITIONAL BUGS . T HE EXECUTION TIME IS SMALLER FOR FAILING
# Pairs of builds in BugSwarm’s paper 1,827 1,264 3,091 BUILDS (1 MIN 12 SEC ) WHICH INDICATES THAT THE PROBLEMS IN THE
# Pairs of builds reproduced 5 times 1,699 1,250 2,949 (95.41%) BUILDS HAVE A SIGNIFICANT IMPACT.
# Pairs of builds with unique commit 998 769 1,767 (57.17%)
# Docker images not available 3 5 8 (0.26%) Metrics Java Python All
Avg. fix time 1 day, 10.5h 1 day, 2.8h 1 day, 7.1h
Table III Med. fix time 28:54 39:52 32:45
N UMBER OF PAIRS OF BUILDS THAT HAVE DUPLICATE CONTENT AND Avg. execution time passing 04:43 04:54 04:48
THAT CHANGE SOURCE CODE . T HOSE METRICS ARE USED TO VERIFY THE Med. execution time passing 03:55 02:57 03:27
REQUIREMENTS ONE AND FOUR FOR AUTOMATIC PROGRAM REPAIR . Avg. execution time failing 02:55 04:30 03:36
Med. execution time failing 02:26 02:46 02:37
Metrics Java Python All
# Empty diffs 1 1 2 (0.11%) buggy and passing builds. The first row presents the number
# Duplicate diffs 101 97 198 (11.21%) of empty diffs, i.e., no change between the buggy and the
# Duplicate messages 178 154 332 (18.79%)
# Diffs that change source 827 445 1,272 (71.99%) patched source. For one build web2py-web2py-61468453, the
# Diffs that only change source 305 161 466 (26.37%) diff is empty because the modification consists of a change in
the configuration of a submodule that does not result in a code
in their final selection of the pairs of builds and consequently changes. We did not find a reasonable explanation for the build
the reported number is in contradiction with the paper. We checkstyle-checkstyle-211109551. The original diff in
observe that the number changes with the time between 2,042 the project repository is not empty, but the diff generated inside
and 2,949. It is possible that the API will not respond the the Docker image is. The second row contains the number
same number in the future. We contacted to authors about this of diffs that are duplicated, i.e., an exact match of the diff
point and they told us that the property had been overwritten according to the md5 hash function. For the third row, we look
by mistake in the database and they fixed it manually. at the commit messages of the passing builds and count how
Our second observation is that 40.08% ((2,949 − many of them are unique. The fourth row shows the number of
1,767)/2,949) of the builds have a duplicate failing commit diffs that change at least one source file, i.e., a Java or Python
id. It means that those 40.08% should not be considered by file. The following row contains the number of diffs that only
the approaches that only consider the source code of the change a source file, i.e., it does not change a configuration
application otherwise it introduces misleading results (see file for example.
requirement 4 in Section II-B). It also shows that Java pairs Table IV presents metrics that are time-related, the first ones
of builds are slightly more impacted by the duplicate commits present the average and median time required by the developer
compared to Python builds (41.26% vs. 38.48%). Finally, to fix their builds. Then, the average and median execution
we observed that eight Docker images are unavailable: time of the failing and passing builds.
Adobe-Consulting-Services-acs-aem-commons-315891915, The takeaways of those tables are that BugSwarm contains
Adobe-Consulting-Services-acs-aem-commons-358605971, duplicate diffs (198) even when we only consider the pairs of
SonarSource-sonar-java-295863948, paramiko-paramiko- builds that have different commit id. They also use frequently
306104686,paramiko-paramiko-306104687, paramiko- the same commit message (332). For example, the message
paramiko-306104688, paramiko-paramiko-306104689, “Added missing javadoc” occurs 34 times, “Added hint for
paramiko-paramiko-306104690. We provided this list to findbugs.” 27 times, “Fixed test.” 19 times, “Fixed javadoc.”
BugSwarm’s authors, we expect that the missing Docker 17 times, “Fix build” 17 times. Most of the pairs of builds
images will be available in the following weeks. modify at least one source code file but it is less frequent
By only considering the pairs of builds that are available, that the developers do not also change a different type of file.
reproduced at least five times successfully and based on a This indicates that BugSwarm contains similar type of changes
unique commit we end up with 1,759 pairs of builds taken that could introduce APR and FL overfitting. It also shows that
from 156 GitHub repositories and with builds that are 2.56 73.63% (1,301) of the builds change at least one non-source
years old on average. file. Consequently, the techniques have to support multiple file
2) Human Patches: For the second angle, we look at the types in order to be evaluated on BugSwarm.
human patches. We analyze the diffs between the buggy source The median time to fix a build is low at 33 minutes.
code and patched source code and the time needed by the Especially, when we consider the median time that is required
human to create a patch. For this analysis, we only consider the to fix a bug like presented by Valdivia et al. [12]. They indicate
1,767 builds that have a unique commit. Since the builds that that it takes for eight open-source projects between 35 and 204
have duplicate commits have the same diff, it would produce days. This highlight, a big difference between build fixes and
a bias if we consider them in this analysis. bug fixes. For example, we observe that some bug fixes consist
Table III presents the main metrics on the diffs between the of ignoring or commenting tests to make the build passes. This
1,400 Table V
N UMBER OF MODIFIED , ADDED AND REMOVED FILES IN THE HUMAN

1,024 (58%)
Java
PATCHES , CONSIDERING UNIQUE COMMITS . T HIS SHOWS THAT APR
1,200 Python NEEDS MULTILOCATION REPAIR ABILITY TO TARGET B UG S WARM .

Metric Java Python All


1,000
# Pair of Builds

# Modified files 9,756 3,321 13,077


# Added files 1,086 470 1,556
800 # Removed files 872 49 921
Avg. # changed files 9.97 4.25 7.71
295 (16.7%)

600
Table VI
N UMBER OF MODIFIED , ADDED AND REMOVED LINES IN THE HUMAN
145 (8.2%)

400

104 (5.9%)
PATCHES , CONSIDERING UNIQUE COMMITS ONLY.
66 (3.7%)

42 (2.4%)

40 (2.3%)
32 (1.8%)

14 (0.8%)
200 Metric Java Python All
# Added lines in modified files 431,880 182,707 614,587
0 # Removed lines in modified files 410,086 326,831 736,917
<1h

<6h

<12h

<24h

<36h

<48h

<1w

<1m

>1m
# Added lines in added files 102,097 121,607 223,704
# Removed lines in removed files 235,225 2,551 237,776
Avg. patch size 1182 824 1026
Figure 3. Amount of time between the buggy commit and the fixing commit
for each pair of builds. Legend: ‘h’ means hour, ‘w’ week and ‘m’ month.
This figure indicates that 86.59% of the builds have been fixed in less than
one day which is much faster compared to an average bug fix time. of BugSwarm’s pairs of builds.
Furthermore, we observe that the diffs from the Docker
type of fix does not fix the regression in the application but images are not always identical to the original diff generated
keep the build status in the green. It is the case for example by GitHub between the failing and passing commits. For
for the build petergeneric-stdlib-160464757. Figure 3 shows example, for the build ansible-ansible-79500861, GitHub sees
the distribution of the fixing time. It indeed shows that 58% 494 changes in the file CHANGELOG.md2 but none is visible
(1,024/1,767) of the build are fixed within one hour and that inside the BugSwarm image.3 We did not manage to find an
86.59% of the builds are fixed in less than one day. Only 54 explanation that explains this difference.
builds are fixed after one week. 3) Reasons for Failures: For the final angle, we analyze the
Finally, the failing execution are finishing faster than the reasons of build failures. In this study, we analyze the failing
passing build. It takes on average one minute and ten seconds execution log to extract the reasons for the failures. We identify
less to execute (22.5%). This indicates that the changes nine different reasons that are presented in Table VII.
between the buggy and passing builds are important. They 1) Test failure, this category contains all the builds that
impacts significantly the execution time of the failing builds. finish with a test failure or a test in error. 2) Checkstyle, those
Therefore, BugSwarm is a challenging target for APR and FL. builds failed because of a checkstyle checker. 3) Compilation
error, the syntax of the code leads to a compilation error
We now focus our analysis on the diff itself. Table V and
or an invalid syntax exception. 4) Doc generation, the build
Table VI present respectively the total number of files changed
stops because of an error is detected in the documentation.
and the total number of changed lines in the BugSwarm
5) Missing license, some files of the build contain invalid
benchmark without considering duplicate commits.
or missing license header. 6) Dependency error, the build
It shows BugSwarm, as expected, that it is mostly existing
did not succeed to download one or several dependencies.
files that are modified. Only a small number of files are added
7) Regression detection, the build introduce regressions in the
and modified. Figure 4 details this analysis by listing the top
API compared to the previous version. 8) Unable to clone,
10 modified file types. It shows that the main source files
during the build, a submodule did not succeed to be cloned.
of are indeed the most frequently modified with 6,975 files
9) Missing main file, the main file to execute the build is
for Java and 2,037 files for Python. It highlights the fact that
missing or is invalid. 10) Unknown, the last category contains
Java projects are more likely to modify more files than Python
all the builds that we did not succeed to categorize in one of
projects.
the ten previous categories. Based on this categorization, we
Table VI shows that the number of added line vs. removed
observe that test failures are by far the most common reasons
line in existing files are relatively similar in the total but differ
for failure with 1,838 builds that fail due to this reason. It
for each language. Indeed, Java diffs contain more added lines
is followed by checkstyle errors and compilations errors with
than removed one, and it is the opposite for Python. It is a
good and bad news for automatic program repair and fault 2 GitHub diff for ansible-ansible-79500861 builds: https://github.com/

localization. It means that the majority of the changes are in ansible/ansible/compare/c747109db9e6bcd7185a3e1e2d451494c035f402.


type of files that are handled by the tools but it also shows that .e0a50dbd9287e7fc3d81bfe7fb49972cb4900599
3 Generate the diff inside BugSwarm image bugswarm run --image-tag
the diffs are big. Most of the current approaches only handle ansible-ansible-79500861 --pipe-stdin <<< "cd; cd build;
changes in one location which is not the case in the majority diff -r failed passed"
Table VII
N UMBER OF PAIR OF BUILDS FOR THE TEN CATEGORIES OF FAILURES CONSIDERED IN THIS PAPER .

Duplicate Commit Unique Commit


Failure type
Java Python Total Java Python All
Test failure 932 (54.86%) 906 (72.48%) 1,838 (62.33%) 575 (57.62%) 487 (63.33%) 1,062 (60.1%)
Checkstyle 320 (18.83%) 8 (0.64%) 328 (11.12%) 149 (14.93%) 5 (0.65%) 154 (8.72%)
Compilation error 263 (15.48%) 33 (2.64%) 296 (10.04%) 167 (16.73%) 16 (2.08%) 183 (10.36%)
Doc generation 0 171 (13.68%) 171 (5.8%) 0 170 (22.11%) 170 (9.62%)
Missing license 21 (1.24%) 0 21 (0.71%) 13 (1.3%) 0 13 (0.74%)
Dependency error 20 (1.18%) 0 20 (0.68%) 12 (1.2%) 0 12 (0.68%)
API Regression 4 (0.24%) 0 4 (0.14%) 2 (0.2%) 0 2 (0.11%)
Unable to clone 3 (0.18%) 0 3 (0.1%) 3 (0.3%) 0 3 (0.17%)
Missing main file 1 (0.06%) 0 1 (0.03%) 1 (0.1%) 0 1 (0.06%)
Unknown 132 (7.77%) 129 (10.32%) 261 (8.85%) 74 (7.41%) 89 (11.57%) 163 (9.22%)

10,000 Table VIII


6,975 (53.3%)

M ETRICS OF B UG S WARM DOWNLOADING AND STORAGE COST.

8,000 Metrics in GB Java Python All


# Modified Files

BugSwarm Docker layer size 5,107 3,813 8,921


BugSwarm unique Docker layer size 1,327 919 2,246
6,000 Avg. size 3.01 3.05 3.03
Download all layers (80 Mbits/s) 6d, 7.8h 4d, 17.3h 11d, 1.16h
2,037 (15.6%)

Download unique layers (80 Mbits/s) 1d, 15.4h 1d, 3.3h 2d, 18.8h
4,000
617 (4.7%)
912 (7%)

D. RQ2. BugSwarm Execution and Storage Cost


787 (6%)

234 (1.8%)

197 (1.5%)

187 (1.4%)

181 (1.4%)

128 (1%)

2,000 In this second research question, we analyze the usage cost


of BugSwarm with a specific focus on execution time and
0 storage.
Java
Python
Whiley
RST
XML
JSON
HTML
SQL
JSP
sysout

First of all, we present the workflow of BugSwarm usage.


1) The first step is to list the available pairs of builds of
Figure 4. Top 10 most frequently modified file types. BugSwarm to get the Docker Tag ID. This first step
requires an access token to BugSwarm’s API.
respectively 328 and 296 occurrences. The following reason, 2) Select the builds to execute, for example, a build from a
Doc generation, is only present in one project: terasolunaorg- specific project that fails due to a checkstyle error.
guideline. This project contains all the documentation for the 3) Download and extract the Docker image. This step is
TERASOLUNA Server Framework. handled directly by Docker.
4) Setup the experiment. This step is project dependent.
BugSwarm provides a folder that is shared with the
Answer to RQ1. What are the main characteristics Docker image (∼/bugswarm-sandbox/ is mapped with
of BugSwarm’s pairs regarding FL and APR require- /bugswarm-sandbox/ inside the Docker image) that is
ments? BugSwarm is reported to have 3,091 pairs of builds used copy files and tools from the local machine to the
however 142 of them are not reproduced five times which is Docker image. This is the main infrastructure to run an
in contradiction with BugSwarm’s paper. Consequently, we experiment on BugSwarm.
considered 2,949 pairs of builds from 156 projects. 63.3% 5) Start the Docker image. BugSwarm provides two different
of those builds contain a unique commit id and are on execution modes. The first mode is an interactive one, it
average 2.56 years old. 58% of the builds have been fixed creates an ssh connection between the host and the docker
within an hour. The human patches contain on average 7.71 image where one can interact with builds using a com-
files changes and a similar amount of addition and removal mand line interface. The second mode allows providing a
of lines. Moreover, 73.63% (1,301) of the builds change command line that will be automatically executed when
at least one non-source file. And the most frequent cause the image is started. The second execution mode is more
of failure is a failing test case that represents 62% of the appropriate for a large scale execution.
case followed by checkstyle errors and compilation errors. 6) The final step is the execution of the experiment itself.
Considering those numbers, it indicates that APR and FL Based on the described workflow, we now present our
need to be able to localize and handle multilocation faults analysis of BugSwarm usage in term of execution time and
and support multiple file types in order to be able to target storage. We identify that step number three is the step that
most of the BugSwarm builds. impacts the most the execution time and the storage required
by the experiment. Table VIII presents the size of BugSwarm
BugSwarm BugSwarm for APR and FL
Docker images. The first line shows the total amount of data (3091 pairs of builds) (163 pairs of builds)

that has to be downloaded. According to our observations,


the ratio between download size and disk storage is 2.48 x 3091 163
and drops to 0.41 x when considering the duplicate layers. For
Build Reproduced 5
example, the image scikit-learn-scikit-learn-83097609 requires times
142 118 No test changed

to download 3.40 GB and takes 7.06 GB space on the disk if


2949
stored alone but takes 1.394 GB if the shared layers are al- 281
ready downloaded. Based on this observation, we estimate the 2945 Unusable
Only change source file
total disk space required to 3,680.45 GB. Note, that this ratio Available Docker image 8 pairs of builds 399 (.java and .py)
for APR and FL
between download size and disk storage has been computed
2941
on OSX with Docker 18.09.2. The ratio can be different on 680

different os and Docker version.


Not empty diff 889 Test failure
The second line of Table VIII presents the total amount 5

of data to download BugSwarm if all the Docker layers


are conserved. Each Docker image is divided into different 1179 198
layers, the layers are shared between the different images and 2946 1569
Unique commit 1767 Unique diff
consequently reduce the total amount of data that need to be
downloaded. Unfortunately, we observe that above 350 GB of
Docker images, Docker slows dramatically down the computer Figure 5. The six filters that select the builds for automatic program repair
and the images have to be removed at a frequent interval to and fault localization.
make the computer responsive again. However, it increases
the total amount of data to download and to decompress, and does not include the cost of transferring the data from AWS
therefore the execution time of the machine. to a different machine (like transferring to final results to a
The third line presents the average image size, it shows that local machine).
on average the Python images are slightly bigger than the Java This costs and the execution time can significantly be
ones. The fourth and fifth lines contain respectively the amount reduced by selecting the pairs of builds to execute. This
of time to download all the layers and the unique layers with question of build selection is discussed in the following
a stable connection of 80 Mbits/s. With this connection, it takes research question.
between 2 and 11 days to download BugSwarm, depending
on the number of Docker layers that need to be downloaded. Answer to RQ2. What is the execution and storage
Based on those numbers, we estimate the cost to download cost of BugSwarm? The estimated cost to download
BugSwarm on an Amazon Cloud Instance to be 45.24 USD. BugSwarm’s pairs of builds on an Amazon Cloud In-
This estimation has been computed by selecting the cheap- stance is 45.24 USD, plus an estimated additional cost of
est virtual machine with 16 GB of RAM.4 This machine 711.30 USD to run an experiment of 20 minutes on each
costs 0.166,4 USD per hour, the renting of the machine is build. This cost is estimated by considering 3,680.45 GB
0.166,4 USD/h ∗ 2d, 18.8h = 11.11 USD. of storage required by BugSwarm and the 2 days and
We now estimate the cost of the storage. The storage on 18.80 hours to download it. This cost is mostly due to
AWS is 0.10 USD per GB per month.5 There are 8,921.94 GB the Docker images that improve the reproducibility of the
∗ 0.41 = 3,680.45 GB to store which costs 368.05 USD per pairs of builds. We consider that it is a reasonable overhead
month or 0.51 USD per hour. Consequently, the storage costs if an access to servers like AWS servers is possible but
0.51 USD/h ∗ 2d, 18.8h = 34.13 USD to download completely unpractical for consumer grade hardware.
BugSwarm. Fortunately, there is no cost to download data from
the internet (Docker images). E. RQ3. BugSwarm for APR and FL
Those costs do not consider the time required to decompress
the images and the execution of an experiment. If we consider In this research question, we are looking at the usage of
a 20 minutes experiment per build, it would result on a BugSwarm for the specific field of automatic program repair
total of 983 h of execution which represents an additional and fault localization.
cost of 0.51 USD/h + 0.166,4 USD/h ∗ 983 h = 666.06 USD. Based on the requirements presented in Section II-B, we
Consequently, we estimate that the starting cost of using identify seven filters to select the pairs of builds that are
BugSwarm on AWS is 11.11 USD + 34.13 USD + 666.06 USD potentially compatible with automatic program repair and fault
= 711.30 USD. This cost is the cost of a single execution and localization. Figure 5 presents the different filters and their
impact on the number of pairs of builds.
4 Amazon stockade pricing: https://aws.amazon.com/en/ec2/pricing/ 1) Build reproduced five times. This filter is presented in
on-demand/ visited the 24 April 2019
5 Amazon stockade pricing: https://aws.amazon.com/en/ebs/pricing/ visited BugSwarm’s paper [4] to ensure that the pairs of builds
the 24 April 2019 are reproducible.
Unknown 3 8 11 (6.7%) BugSwarm’s pairs of builds. The website illustrates the impact
Java
of the different builds nicely. We recommend the reader to
Python check it to have the perspective of BugSwarm’s content.
Non-bug fix 12 28 40 (24.5%)
Answer to RQ3. Which pairs of builds meet the require-
ments of Automatic Program Repair (APR) and Fault
Bug fix 50 62 112 (68.7%) Localization (FL) techniques? We design eight filters to
select the pairs of builds based on the requirements of APR
# Pairs of Builds and FL (see Section II-B). We identify 146 compatible pairs
of builds, 81 for Java and 66 for Python which represent
Figure 6. Type of patches for the 146 pairs of builds compatible with APR
and FL. 4.72% of the BugSwarm benchmark. On those 146 pairs
of builds, we identify manually 99 bug fix (3.2%), 48 for
2) Available docker image. We identify eight Docker images Java (2.62%), 51 for Python (4.03%).
that are missing. We remove them since they cannot be
used. IV. D ISCUSSION
3) Not Empty diff. We only consider the pairs of builds
are not empty. Since the changes can be related to A. Failing builds vs. bugs
configuration. The current literature of automatic program BugSwarm is a benchmark of pairs of builds, however, its
repair and fault localization does not target this problem/ name, BugSwarm, is misleading. It leads one to think that
4) Unique commit. 1,179 pairs of builds have a failing it is a benchmark of bugs. Only by analyzing the content
commit that is already present in BugSwarm. We remove of a benchmark that is possible to realize its true nature.
those pairs of builds since all current approaches are Indeed, we showed in RQ1 (see Section III-C) that there are
currently working on the source code, the same bug will ten different reasons producing failing builds, such as test
be the same on different pairs of builds which can lead failures, checkstyle, compilation errors. Where only the test
to bias in the experiment. failures expose bugs, the other types of failures do not produce
5) Unique diff. This filter is similar to the previous one but incorrect behavior and therefore should not be considered as
verifies that the diff of each pair of builds is unique. bugs. For example, the build checkstyle-checkstyle-77722344
6) Test-case failure. The current approaches of automatic fails because of a checkstyle error: Redundant ‘public’
program repair and fault-localization rely on a failing test- modifier, as visible in its execution log: https://travis-ci.
case to expose the bug, without a failing test-case those org/checkstyle/checkstyle/jobs/77722344, SonarSource-sonar-
approaches cannot be executed. java-74910602 fails because it is unable to clone one of its
7) Only change source file. This filter removes all the pairs submodules: https://travis-ci.org/SonarSource/sonar-java/jobs/
of builds that modify files that are not source code (.py 74910602.
or .java). We apply this filter since to our knowledge no The confusion between BugSwarm’s name and its actual
approach is able to handle non-source code files. content can lead to invalid recommendations from reviewers
8) No test changed. The final filter removes the pairs of or, much worse, incorrect analysis of empirical evaluations
builds that modify a test-case. We remove those builds that use BugSwarm blindly. We would like to bring to terms
since the modification of the test case change the oracle the importance of the name especially for artifacts that are
and therefore the buggy version of the application contain designed to be used for other researchers.
either an invalid oracle or not up-to-date one.
After applying the outlined filters on BugSwarm, we are B. Comparison between BugSwarm and existing benchmarks
reduced down to 163 pairs of builds, 65 for Java and 98 for for APR and FL
Python. Intriguingly, 154 of the 163 pairs of builds have been There are a growing number of benchmarks for auto-
fixed in less than 24 hours (94% of which have been fixed matic program repair and fault localization, for example,
within one hour). Defects4J [5], Bears [2], IntroClass [6], ManyBugs [6], In-
We then manually categorize those 163 pairs of builds into troClassJava [7], Bugs.jar [8], QuixBugs [9], BUGSJS [1].
three categories of patches: 1) Bug fix is a patch that we BugSwarm shares with Bears [2] the modus operandi used
identify as a bug fix 2) Non-bug fix is a patch that we identify to collect the data in the benchmark, they both use TravisCI
as not a bug fix 3) Unknown is a patch that we did not succeed builds as source.
to categorize to due to a lack of domain knowledge. Figure 6 BugSwarm and Bears are the most similar benchmarks, both
presents the results of our manual analysis. It shows that we of them use TravisCI to collect builds. Bears [2] focuses on
identify 112 patches that fix a bug, 50 for Java and 62 for bugs by reproducing the build and manually analyzed the
Python. 40 pairs of builds have been identified as not a fix for human patch, whereas BugSwarm focuses on reproducible
a bug and 11 others as unknown. pairs of builds in a more generic way. BugSwarm’s infras-
We provide the complete list of builds in our repository [11] tructure is unique: it is the only benchmark that encapsulated
and a website [10] that allows browsing, filtering and searching each artifact in a Docker image. This infrastructure comes
Table IX 2) [Exploration] The second recommendation is to provide
R ECOMMENDATIONS VS . B ENCHMARKS (3: SATISFIED ; ∼: PARTIALLY the diff of each artifact. The diff can be easily understood
SATISFIED ).
by humans and give a quick understanding of the content
of the artifact. We also recommend providing a selection

IntroClassJava
of artifacts for each research field if all the artifacts are

BugSwarm
ManyBugs

IntroClass
QuixBugs
Defects4J

BUGSJS
not compatible with the research field. This ensures that

Bugs.jar

iBugs
Bears
the same selection is used across papers and guaranty a
Lessons learned fair comparison between approaches;
Target 3 ∼ 3 ∼ ∼ ∼ ∼ ∼ ∼ ∼ 3) [Dependencies] The next recommendation is to improve
Exploration 3 3
Dependencies 3
the reproducibility of the benchmark by including the
Access 3 ∼ 3 ∼ ∼ ∼ ∼ ∼ ∼ ∼ dependencies of your artifacts since it is the main reason
Documentation 3 3 3 ∼ for unreproducibility (as pointed out by the authors of
Versioning 3
BugSwarm);
4) [Access] Provide full access to your benchmark and meta-
with the advantage of an improved reproducibility. However, data. Benchmarks are by nature an artifact that should
it will still suffer from unreproducible builds due to invalid be used in different research works to compare against
dependencies, for example, a snapshot version can be updated other related solutions. Furthermore, Offer the tools used
with breaking changes that will break the compilation of a to create the benchmarks as open source – it is required
BugSwarm artifact. Indeed, the Docker image does not include to ensure the representativeness of the benchmarks;
the dependencies of the application which can lead to missing 5) [Documentation] Make sure to have proper documenta-
or invalid dependencies. This problem is the most frequent tion for your benchmark that explains how to checkout
source of unreproducibility of builds according to BugSwarm’s one single bug but also how to execute a large scale
authors. Docker images come also with disadvantages such experiment on it. And avoid “coming soon” messages
as a considerable size and execution overhead, as shown or at least provide an email address with that message.;
in RQ2 (see Section III-D). For the readers’ reference, the 6) [Versioning] When benchmarks evolve, it is important
repository [11] that contains all the sources of all the builds to version it, to be able to always point out a previous
is less than 2GB, compared to 3,680.45GB of BugSwarm. version that was used in a specific experiment. Indeed,
The BugSwarm diffs between the failing and passing ver- without a version number it is difficult for authors to
sion are also different from the existing benchmark. Sobreira refer to which version of the benchmark and therefore
et al. [13] observe that the average patch size in Defects4J readers cannot know which artifacts have been used. We
is 4 lines. Madeiral et al. [2] report a patch size of 8 lines recommend that each time a new artifact is added to the
in Bears. In BugSwarm, we observe an average patch size of benchmark, a new version is created like it is done in
1,026 lines. This difference of size highlights the difference Bears benchmark [2].
of nature of BugSwarm compared to the other benchmarks. Table IX puts into perspective the lessons learned from
To sum up, the difference between BugSwarm and the this section with nine other existing benchmarks of bugs. We
literature is threefold: observe that none benchmark meets all our recommendations.
• A benchmark of reproducible builds that can support new However, Defects4J [5] and Bears [2] already meet 5/6 of our
researches on build repair; recommendations.
• The diffs are much bigger than the literature; D. Threats to Validity
• A novel infrastructure to store and interact with bugs.
As any implementation, the scripts that we use to collect
C. Lessons learned BugSwarm’s builds and the metrics are potentially not free
of bugs. A bug might impact the results we reported in Sec-
Our analysis of BugSwarm allowed us to understand what tion III. However, the script and the raw output are open-source
constitutes a reasonable benchmark that is suited to fault and publicly available for other researchers and potential users
localization and automatic program repair. In this section, to check the validity of the results.
we discuss several recommendations on how to build and Moreover, BugSwarm could also be impacted by potential
make available benchmarks of bugs. Also, we draw some bugs and the benchmark itself can be updated in the future.
conclusions on how to improve BugSwarm. Therefore, the observed result can differ in the future. How-
1) [Target] The first recommendation is to think of the ever, we provide all the scripts that are required to redo this
requirements of the research fields that the benchmark analysis and update if needed. Moreover, the website and the
target. In this case, APR and FL require bugs to be ex- repository can be updated to take into account the changes in
posed by a failing test case as well as metadata about each BugSwarm.
bug such as the location of the source and/or the binaries. This analysis focuses on the nowadays requirements of
A benchmark that fails to provide such information is ill- automatic program repair and fault localization techniques.
suited to fault localization and automatic program repair; Those requirements can evolve with the time, and therefore
the bug selection presented in RQ3 (see Section III-E) can be Some benchmarks also include an analysis of their content.
inadequate in the future. iBugs [14], Codeflaws [18] and Bears [2] contain annotated
bugs with size and syntactic properties on their patches. Many-
E. Discussion with BugSwarm’s Authors Bugs [6] analyses the patches and annotate the bugs when
We communicated our results with BugSwarm’s authors to functions, loops, conditional and function calls were added
get their feedback and opinion on it. It engendered a really or when function signatures are changed. Those analyses are
interesting discussion about their vision and future directions comparable to the analysis included in BugSwarm paper even
for the benchmark. On the one hand, their feedback gave us if the results of the analysis a different like presented in
the opportunity to improve and clarify the paper. On the other Section IV-B.
hand, our work allowed them to identify and fix issues in This paper presents an analysis using different metrics such
BugSwarm such as: duplicate artifact id, inconsistent number as fix time, size of the benchmark, execution time, cost, or
of pairs of builds, unavailable Docker images. We would like failure type. Moreover, it also includes an analysis of the
to thank them for their responsiveness and feedbacks. content of the benchmark regarding the research field that it
targets.
V. R ELATED W ORKS
VI. C ONCLUSION
A. Benchmarks
We first present the benchmarks for automatic program This paper presents an analysis of the BugSwarm bench-
repair and fault localization from the literature. The literature mark. We start off by analyzing the pairs of builds, the human
contains several benchmarks of Java bugs. Defects4J [5] is the patches, and failure reasons. We observed that 142 pairs of
most used benchmark for automatic program repair and fault builds do not match BugSwarm’s including criteria (has to be
localization. It contains 395 minimized bugs from six widely reproduced five times), 1,182 pairs of builds contain duplicate
used open source Java projects. It has been created by mining commits, and 8 pairs of builds are no longer available. The
Apache issue tracker. Bugs.jar [8] contains 1,158 bugs from human patches modify on average 1,026 lines in 7.71 files.
eight Apache projects. It was created using the same strategy The failing builds fails for 62.32% due to a test, 11.12% due
than Defects4J. IntroClassJava [7] contains 297 bugs from six to a checkstyle error and 10.03% for a compilation error.
different student projects. It is a transpiled version to Java We then analyzed the overhead introduced by the
of the bugs from the C benchmark IntroClass [6]. Bears [2] BugSwarm infrastructure compared to a traditional reposi-
contains 251 bugs from 72 different GitHub projects. It was tory. We estimate the overhead as 45.24 USD, 2d, 18.8h and
created by mining TravisCI builds. And iBugs [14] which 3,680.45 GB. This is the costs of an improved reproductivity.
contains 390 Java bugs. Finally, we analyzed BugSwarm with the optic to use
The literature also contains benchmark of C programs such BugSwarm for automatic program repair and fault localization.
as: SIR [15] is a benchmark of seeded faults from nine small to We identify six requirements that are needed in order to be
medium-scale programs in C language. ManyBugs [6] contains able to use the pairs of builds in such domains: availability,
185 bugs from nine open-source C programs. IntroClass [6] uniqueness of the commit, uniqueness of the diff, source
contains 572 bugs from six student programs. code based, test failure, test-suite not modified. We manually
We decided to analyze BugSwarm instead of the other identified 163 potentially relevant pairs of builds, and we
benchmark of the literature since BugSwarm is a new bench- determined that 112 of them are bug fixes, 40 non-bug fixes
mark that will enjoy good visibility through ICSE conference and 11 unknowns.
and also because its creation process and its content are BugSwarm has been presented as a benchmark of pairs of
different from other benchmarks. builds for automatically program repair and fault localization
but only 112/3,091 (3.6%) are compatible with the current
B. Benchmark Analysis automatic program repair, and fault localization approaches
— a number that falls short of other related benchmarks.
The literature contains some studies on existing benchmarks
Moreover, BugSwarm’s name is confusing, it makes one think
of bugs. defect characteristics: defect importance, complexity,
that the benchmark contains bugs, but it is a benchmark of
independence, test effectiveness, and characteristics of the
builds which is conceptually rather different.
human-written patch. Sobreira et al. [13] also analyze De-
fects4J but focus on the identification of repair actions and During our study, we have collected a number of findings
repair patterns. Madeiral et al. [16] automatize the extraction on the requirements that make a benchmark well suited for
of repair actions and repair patterns from diff. Wang et al. fault localization and program repair. We have discussed this
[17] present a study that analyzes the impact of ignoring the in detail, paving the way for upcoming benchmark sets.
project Mockito from empirical evaluations that use Defects4J.
ACKNOWLEDGMENTS
They show that automatic program repair techniques have
poorer performance on Mockito and ignoring it can introduce This material is based upon work supported by Fundação
misleading results. It highlights the importance of the selection para a Ciência e a Tecnologia (FCT), with the reference
of the artifacts from benchmarks. PTDC/CCI-COM/29300/2017.
R EFERENCES Quixey Challenge,” in Proceedings of the 2017 ACM SIGPLAN
International Conference on Systems, Programming, Languages, and
[1] P. Gyimesi, B. Vancsics, A. Stocco, D. Mazinanian, A. Beszédes,
Applications: Software for Humanity (SPLASH Companion 2017).
R. Ferenc, and A. Mesbah, “Bugsjs: A benchmark of javascript bugs,” in
New York, NY, USA: ACM, 2017, pp. 55–56. [Online]. Available:
Proceedings of the 12th International Conference on Software Testing,
http://doi.acm.org/10.1145/3135932.3135941
Verification, and Validation, ICST. To appear, 2019.
[10] Anonymous, “BugSwarm browsing website,” https://tqrg.github.io/
[2] F. Madeiral, S. Urli, M. Maia, and M. Monperrus, “Bears: An
BugSwarm/, 2019.
Extensible Java Bug Benchmark for Automatic Program Repair
[11] ——, “BugSwarm artifacts and scripts,” https://github.com/TQRG/
Studies,” in Proceedings of the 26th IEEE International Conference
BugSwarm, 2019.
on Software Analysis, Evolution and Reengineering (SANER ’19).
[12] H. Valdivia Garcia and E. Shihab, “Characterizing and predicting
Hangzhou, China: IEEE, 2019, pp. 1–11, to appear. [Online]. Available:
blocking bugs in open source projects,” in MSR’14. ACM, 2014, pp.
https://arxiv.org/abs/1901.06024
72–81.
[3] A. Majd, M. Vahidi-Asl, A. Khalilian, A. Baraani-Dastjerdi, and B. Za-
[13] V. Sobreira, T. Durieux, F. Madeiral, M. Monperrus, and
mani, “Code4bench: A multidimensional benchmark of codeforces data
M. de Almeida Maia, “Dissection of a bug dataset: Anatomy of 395
for different program analysis techniques,” Journal of Computer Lan-
patches from defects4j,” in 2018 IEEE 25th International Conference
guages, 2019.
on Software Analysis, Evolution and Reengineering (SANER). IEEE,
[4] N. Dmeiri, D. A. Tomassi, Y. Wang, A. Bhowmick, Y.-C. Liu, P. De-
2018, pp. 130–140.
vanbu, B. Vasilescu, and C. Rubio-González, “Bugswarm: Mining and
[14] V. Dallmeier and T. Zimmermann, “Extraction of Bug Localization
continuously growing a dataset of reproducible failures and fixes,” in
Benchmarks from History,” in Proceedings of the 22nd IEEE/ACM
Proceedings of the 41th International Conference on Software Engi-
International Conference on Automated Software Engineering (ASE
neering, 2019.
’07). New York, NY, USA: ACM, 2007, pp. 433–436. [Online].
[5] R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A Database of Existing
Available: http://doi.acm.org/10.1145/1321631.1321702
Faults to Enable Controlled Testing Studies for Java Programs,” in
[15] H. Do, S. Elbaum, and G. Rothermel, “Supporting controlled experi-
Proceedings of the 23rd International Symposium on Software Testing
mentation with testing techniques: An infrastructure and its potential
and Analysis (ISSTA ’14). New York, NY, USA: ACM, 2014, pp. 437–
impact,” Empirical Software Engineering, vol. 10, no. 4, pp. 405–435,
440. [Online]. Available: http://doi.acm.org/10.1145/2610384.2628055
2005.
[6] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu,
[16] F. Madeiral, T. Durieux, V. Sobreira, and M. Maia, “Towards an
S. Forrest, and W. Weimer, “The ManyBugs and IntroClass Benchmarks
automated approach for bug fix pattern detection,” arXiv preprint
for Automated Repair of C Programs,” IEEE Transactions on Software
arXiv:1807.11286, 2018.
Engineering, vol. 41, no. 12, pp. 1236–1256, Dec. 2015.
[17] S. Wang, M. Wen, X. Mao, and D. Yang, “Attention please: Consider
[7] T. Durieux and M. Monperrus, “IntroClassJava: A Benchmark of 297
mockito when evaluating newly proposed automated program repair
Small and Buggy Java Programs,” University of Lille, University of
techniques,” in Proceedings of the Evaluation and Assessment on
Lille, Tech. Rep. #hal-01272126, 2016.
Software Engineering. ACM, 2019, pp. 260–266.
[8] R. K. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. R. Prasad, [18] S. H. Tan, J. Yi, Yulis, S. Mechtaev, and A. Roychoudhury,
“Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs,” in “Codeflaws: A Programming Competition Benchmark for Evaluating
Proceedings of the 15th International Conference on Mining Software Automated Program Repair Tools,” in Proceedings of the 39th
Repositories (MSR ’18). New York, NY, USA: ACM, 2018, pp. 10–13. International Conference on Software Engineering Companion (ICSE-C
[Online]. Available: http://doi.acm.org/10.1145/3196398.3196473 ’17). Piscataway, NJ, USA: IEEE Press, 2017, pp. 180–182. [Online].
[9] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “QuixBugs: Available: https://doi.org/10.1109/ICSE-C.2017.76
A Multi-Lingual Program Repair Benchmark Set Based on the

You might also like