BugSwarm Benchmark Analysis
BugSwarm Benchmark Analysis
Abstract—Benchmarks play an essential role in evaluating (3,091/101,265) of the pairs of builds they considered. They
the efficiency and effectiveness of solutions to automate several ended up with 1,827 pairs of Java builds and 1,264 pairs of
phases of the software development lifecycle. Moreover, if well Python builds that are between one and four years old. Those
designed, they also serve us well as an important artifact to
builds are extracted from 108 Java projects and 52 Python
arXiv:1905.09375v1 [cs.SE] 22 May 2019
1,024 (58%)
Java
PATCHES , CONSIDERING UNIQUE COMMITS . T HIS SHOWS THAT APR
1,200 Python NEEDS MULTILOCATION REPAIR ABILITY TO TARGET B UG S WARM .
600
Table VI
N UMBER OF MODIFIED , ADDED AND REMOVED LINES IN THE HUMAN
145 (8.2%)
400
104 (5.9%)
PATCHES , CONSIDERING UNIQUE COMMITS ONLY.
66 (3.7%)
42 (2.4%)
40 (2.3%)
32 (1.8%)
14 (0.8%)
200 Metric Java Python All
# Added lines in modified files 431,880 182,707 614,587
0 # Removed lines in modified files 410,086 326,831 736,917
<1h
<6h
<12h
<24h
<36h
<48h
<1w
<1m
>1m
# Added lines in added files 102,097 121,607 223,704
# Removed lines in removed files 235,225 2,551 237,776
Avg. patch size 1182 824 1026
Figure 3. Amount of time between the buggy commit and the fixing commit
for each pair of builds. Legend: ‘h’ means hour, ‘w’ week and ‘m’ month.
This figure indicates that 86.59% of the builds have been fixed in less than
one day which is much faster compared to an average bug fix time. of BugSwarm’s pairs of builds.
Furthermore, we observe that the diffs from the Docker
type of fix does not fix the regression in the application but images are not always identical to the original diff generated
keep the build status in the green. It is the case for example by GitHub between the failing and passing commits. For
for the build petergeneric-stdlib-160464757. Figure 3 shows example, for the build ansible-ansible-79500861, GitHub sees
the distribution of the fixing time. It indeed shows that 58% 494 changes in the file CHANGELOG.md2 but none is visible
(1,024/1,767) of the build are fixed within one hour and that inside the BugSwarm image.3 We did not manage to find an
86.59% of the builds are fixed in less than one day. Only 54 explanation that explains this difference.
builds are fixed after one week. 3) Reasons for Failures: For the final angle, we analyze the
Finally, the failing execution are finishing faster than the reasons of build failures. In this study, we analyze the failing
passing build. It takes on average one minute and ten seconds execution log to extract the reasons for the failures. We identify
less to execute (22.5%). This indicates that the changes nine different reasons that are presented in Table VII.
between the buggy and passing builds are important. They 1) Test failure, this category contains all the builds that
impacts significantly the execution time of the failing builds. finish with a test failure or a test in error. 2) Checkstyle, those
Therefore, BugSwarm is a challenging target for APR and FL. builds failed because of a checkstyle checker. 3) Compilation
error, the syntax of the code leads to a compilation error
We now focus our analysis on the diff itself. Table V and
or an invalid syntax exception. 4) Doc generation, the build
Table VI present respectively the total number of files changed
stops because of an error is detected in the documentation.
and the total number of changed lines in the BugSwarm
5) Missing license, some files of the build contain invalid
benchmark without considering duplicate commits.
or missing license header. 6) Dependency error, the build
It shows BugSwarm, as expected, that it is mostly existing
did not succeed to download one or several dependencies.
files that are modified. Only a small number of files are added
7) Regression detection, the build introduce regressions in the
and modified. Figure 4 details this analysis by listing the top
API compared to the previous version. 8) Unable to clone,
10 modified file types. It shows that the main source files
during the build, a submodule did not succeed to be cloned.
of are indeed the most frequently modified with 6,975 files
9) Missing main file, the main file to execute the build is
for Java and 2,037 files for Python. It highlights the fact that
missing or is invalid. 10) Unknown, the last category contains
Java projects are more likely to modify more files than Python
all the builds that we did not succeed to categorize in one of
projects.
the ten previous categories. Based on this categorization, we
Table VI shows that the number of added line vs. removed
observe that test failures are by far the most common reasons
line in existing files are relatively similar in the total but differ
for failure with 1,838 builds that fail due to this reason. It
for each language. Indeed, Java diffs contain more added lines
is followed by checkstyle errors and compilations errors with
than removed one, and it is the opposite for Python. It is a
good and bad news for automatic program repair and fault 2 GitHub diff for ansible-ansible-79500861 builds: https://github.com/
Download unique layers (80 Mbits/s) 1d, 15.4h 1d, 3.3h 2d, 18.8h
4,000
617 (4.7%)
912 (7%)
234 (1.8%)
197 (1.5%)
187 (1.4%)
181 (1.4%)
128 (1%)
IntroClassJava
of artifacts for each research field if all the artifacts are
BugSwarm
ManyBugs
IntroClass
QuixBugs
Defects4J
BUGSJS
not compatible with the research field. This ensures that
Bugs.jar
iBugs
Bears
the same selection is used across papers and guaranty a
Lessons learned fair comparison between approaches;
Target 3 ∼ 3 ∼ ∼ ∼ ∼ ∼ ∼ ∼ 3) [Dependencies] The next recommendation is to improve
Exploration 3 3
Dependencies 3
the reproducibility of the benchmark by including the
Access 3 ∼ 3 ∼ ∼ ∼ ∼ ∼ ∼ ∼ dependencies of your artifacts since it is the main reason
Documentation 3 3 3 ∼ for unreproducibility (as pointed out by the authors of
Versioning 3
BugSwarm);
4) [Access] Provide full access to your benchmark and meta-
with the advantage of an improved reproducibility. However, data. Benchmarks are by nature an artifact that should
it will still suffer from unreproducible builds due to invalid be used in different research works to compare against
dependencies, for example, a snapshot version can be updated other related solutions. Furthermore, Offer the tools used
with breaking changes that will break the compilation of a to create the benchmarks as open source – it is required
BugSwarm artifact. Indeed, the Docker image does not include to ensure the representativeness of the benchmarks;
the dependencies of the application which can lead to missing 5) [Documentation] Make sure to have proper documenta-
or invalid dependencies. This problem is the most frequent tion for your benchmark that explains how to checkout
source of unreproducibility of builds according to BugSwarm’s one single bug but also how to execute a large scale
authors. Docker images come also with disadvantages such experiment on it. And avoid “coming soon” messages
as a considerable size and execution overhead, as shown or at least provide an email address with that message.;
in RQ2 (see Section III-D). For the readers’ reference, the 6) [Versioning] When benchmarks evolve, it is important
repository [11] that contains all the sources of all the builds to version it, to be able to always point out a previous
is less than 2GB, compared to 3,680.45GB of BugSwarm. version that was used in a specific experiment. Indeed,
The BugSwarm diffs between the failing and passing ver- without a version number it is difficult for authors to
sion are also different from the existing benchmark. Sobreira refer to which version of the benchmark and therefore
et al. [13] observe that the average patch size in Defects4J readers cannot know which artifacts have been used. We
is 4 lines. Madeiral et al. [2] report a patch size of 8 lines recommend that each time a new artifact is added to the
in Bears. In BugSwarm, we observe an average patch size of benchmark, a new version is created like it is done in
1,026 lines. This difference of size highlights the difference Bears benchmark [2].
of nature of BugSwarm compared to the other benchmarks. Table IX puts into perspective the lessons learned from
To sum up, the difference between BugSwarm and the this section with nine other existing benchmarks of bugs. We
literature is threefold: observe that none benchmark meets all our recommendations.
• A benchmark of reproducible builds that can support new However, Defects4J [5] and Bears [2] already meet 5/6 of our
researches on build repair; recommendations.
• The diffs are much bigger than the literature; D. Threats to Validity
• A novel infrastructure to store and interact with bugs.
As any implementation, the scripts that we use to collect
C. Lessons learned BugSwarm’s builds and the metrics are potentially not free
of bugs. A bug might impact the results we reported in Sec-
Our analysis of BugSwarm allowed us to understand what tion III. However, the script and the raw output are open-source
constitutes a reasonable benchmark that is suited to fault and publicly available for other researchers and potential users
localization and automatic program repair. In this section, to check the validity of the results.
we discuss several recommendations on how to build and Moreover, BugSwarm could also be impacted by potential
make available benchmarks of bugs. Also, we draw some bugs and the benchmark itself can be updated in the future.
conclusions on how to improve BugSwarm. Therefore, the observed result can differ in the future. How-
1) [Target] The first recommendation is to think of the ever, we provide all the scripts that are required to redo this
requirements of the research fields that the benchmark analysis and update if needed. Moreover, the website and the
target. In this case, APR and FL require bugs to be ex- repository can be updated to take into account the changes in
posed by a failing test case as well as metadata about each BugSwarm.
bug such as the location of the source and/or the binaries. This analysis focuses on the nowadays requirements of
A benchmark that fails to provide such information is ill- automatic program repair and fault localization techniques.
suited to fault localization and automatic program repair; Those requirements can evolve with the time, and therefore
the bug selection presented in RQ3 (see Section III-E) can be Some benchmarks also include an analysis of their content.
inadequate in the future. iBugs [14], Codeflaws [18] and Bears [2] contain annotated
bugs with size and syntactic properties on their patches. Many-
E. Discussion with BugSwarm’s Authors Bugs [6] analyses the patches and annotate the bugs when
We communicated our results with BugSwarm’s authors to functions, loops, conditional and function calls were added
get their feedback and opinion on it. It engendered a really or when function signatures are changed. Those analyses are
interesting discussion about their vision and future directions comparable to the analysis included in BugSwarm paper even
for the benchmark. On the one hand, their feedback gave us if the results of the analysis a different like presented in
the opportunity to improve and clarify the paper. On the other Section IV-B.
hand, our work allowed them to identify and fix issues in This paper presents an analysis using different metrics such
BugSwarm such as: duplicate artifact id, inconsistent number as fix time, size of the benchmark, execution time, cost, or
of pairs of builds, unavailable Docker images. We would like failure type. Moreover, it also includes an analysis of the
to thank them for their responsiveness and feedbacks. content of the benchmark regarding the research field that it
targets.
V. R ELATED W ORKS
VI. C ONCLUSION
A. Benchmarks
We first present the benchmarks for automatic program This paper presents an analysis of the BugSwarm bench-
repair and fault localization from the literature. The literature mark. We start off by analyzing the pairs of builds, the human
contains several benchmarks of Java bugs. Defects4J [5] is the patches, and failure reasons. We observed that 142 pairs of
most used benchmark for automatic program repair and fault builds do not match BugSwarm’s including criteria (has to be
localization. It contains 395 minimized bugs from six widely reproduced five times), 1,182 pairs of builds contain duplicate
used open source Java projects. It has been created by mining commits, and 8 pairs of builds are no longer available. The
Apache issue tracker. Bugs.jar [8] contains 1,158 bugs from human patches modify on average 1,026 lines in 7.71 files.
eight Apache projects. It was created using the same strategy The failing builds fails for 62.32% due to a test, 11.12% due
than Defects4J. IntroClassJava [7] contains 297 bugs from six to a checkstyle error and 10.03% for a compilation error.
different student projects. It is a transpiled version to Java We then analyzed the overhead introduced by the
of the bugs from the C benchmark IntroClass [6]. Bears [2] BugSwarm infrastructure compared to a traditional reposi-
contains 251 bugs from 72 different GitHub projects. It was tory. We estimate the overhead as 45.24 USD, 2d, 18.8h and
created by mining TravisCI builds. And iBugs [14] which 3,680.45 GB. This is the costs of an improved reproductivity.
contains 390 Java bugs. Finally, we analyzed BugSwarm with the optic to use
The literature also contains benchmark of C programs such BugSwarm for automatic program repair and fault localization.
as: SIR [15] is a benchmark of seeded faults from nine small to We identify six requirements that are needed in order to be
medium-scale programs in C language. ManyBugs [6] contains able to use the pairs of builds in such domains: availability,
185 bugs from nine open-source C programs. IntroClass [6] uniqueness of the commit, uniqueness of the diff, source
contains 572 bugs from six student programs. code based, test failure, test-suite not modified. We manually
We decided to analyze BugSwarm instead of the other identified 163 potentially relevant pairs of builds, and we
benchmark of the literature since BugSwarm is a new bench- determined that 112 of them are bug fixes, 40 non-bug fixes
mark that will enjoy good visibility through ICSE conference and 11 unknowns.
and also because its creation process and its content are BugSwarm has been presented as a benchmark of pairs of
different from other benchmarks. builds for automatically program repair and fault localization
but only 112/3,091 (3.6%) are compatible with the current
B. Benchmark Analysis automatic program repair, and fault localization approaches
— a number that falls short of other related benchmarks.
The literature contains some studies on existing benchmarks
Moreover, BugSwarm’s name is confusing, it makes one think
of bugs. defect characteristics: defect importance, complexity,
that the benchmark contains bugs, but it is a benchmark of
independence, test effectiveness, and characteristics of the
builds which is conceptually rather different.
human-written patch. Sobreira et al. [13] also analyze De-
fects4J but focus on the identification of repair actions and During our study, we have collected a number of findings
repair patterns. Madeiral et al. [16] automatize the extraction on the requirements that make a benchmark well suited for
of repair actions and repair patterns from diff. Wang et al. fault localization and program repair. We have discussed this
[17] present a study that analyzes the impact of ignoring the in detail, paving the way for upcoming benchmark sets.
project Mockito from empirical evaluations that use Defects4J.
ACKNOWLEDGMENTS
They show that automatic program repair techniques have
poorer performance on Mockito and ignoring it can introduce This material is based upon work supported by Fundação
misleading results. It highlights the importance of the selection para a Ciência e a Tecnologia (FCT), with the reference
of the artifacts from benchmarks. PTDC/CCI-COM/29300/2017.
R EFERENCES Quixey Challenge,” in Proceedings of the 2017 ACM SIGPLAN
International Conference on Systems, Programming, Languages, and
[1] P. Gyimesi, B. Vancsics, A. Stocco, D. Mazinanian, A. Beszédes,
Applications: Software for Humanity (SPLASH Companion 2017).
R. Ferenc, and A. Mesbah, “Bugsjs: A benchmark of javascript bugs,” in
New York, NY, USA: ACM, 2017, pp. 55–56. [Online]. Available:
Proceedings of the 12th International Conference on Software Testing,
http://doi.acm.org/10.1145/3135932.3135941
Verification, and Validation, ICST. To appear, 2019.
[10] Anonymous, “BugSwarm browsing website,” https://tqrg.github.io/
[2] F. Madeiral, S. Urli, M. Maia, and M. Monperrus, “Bears: An
BugSwarm/, 2019.
Extensible Java Bug Benchmark for Automatic Program Repair
[11] ——, “BugSwarm artifacts and scripts,” https://github.com/TQRG/
Studies,” in Proceedings of the 26th IEEE International Conference
BugSwarm, 2019.
on Software Analysis, Evolution and Reengineering (SANER ’19).
[12] H. Valdivia Garcia and E. Shihab, “Characterizing and predicting
Hangzhou, China: IEEE, 2019, pp. 1–11, to appear. [Online]. Available:
blocking bugs in open source projects,” in MSR’14. ACM, 2014, pp.
https://arxiv.org/abs/1901.06024
72–81.
[3] A. Majd, M. Vahidi-Asl, A. Khalilian, A. Baraani-Dastjerdi, and B. Za-
[13] V. Sobreira, T. Durieux, F. Madeiral, M. Monperrus, and
mani, “Code4bench: A multidimensional benchmark of codeforces data
M. de Almeida Maia, “Dissection of a bug dataset: Anatomy of 395
for different program analysis techniques,” Journal of Computer Lan-
patches from defects4j,” in 2018 IEEE 25th International Conference
guages, 2019.
on Software Analysis, Evolution and Reengineering (SANER). IEEE,
[4] N. Dmeiri, D. A. Tomassi, Y. Wang, A. Bhowmick, Y.-C. Liu, P. De-
2018, pp. 130–140.
vanbu, B. Vasilescu, and C. Rubio-González, “Bugswarm: Mining and
[14] V. Dallmeier and T. Zimmermann, “Extraction of Bug Localization
continuously growing a dataset of reproducible failures and fixes,” in
Benchmarks from History,” in Proceedings of the 22nd IEEE/ACM
Proceedings of the 41th International Conference on Software Engi-
International Conference on Automated Software Engineering (ASE
neering, 2019.
’07). New York, NY, USA: ACM, 2007, pp. 433–436. [Online].
[5] R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A Database of Existing
Available: http://doi.acm.org/10.1145/1321631.1321702
Faults to Enable Controlled Testing Studies for Java Programs,” in
[15] H. Do, S. Elbaum, and G. Rothermel, “Supporting controlled experi-
Proceedings of the 23rd International Symposium on Software Testing
mentation with testing techniques: An infrastructure and its potential
and Analysis (ISSTA ’14). New York, NY, USA: ACM, 2014, pp. 437–
impact,” Empirical Software Engineering, vol. 10, no. 4, pp. 405–435,
440. [Online]. Available: http://doi.acm.org/10.1145/2610384.2628055
2005.
[6] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu,
[16] F. Madeiral, T. Durieux, V. Sobreira, and M. Maia, “Towards an
S. Forrest, and W. Weimer, “The ManyBugs and IntroClass Benchmarks
automated approach for bug fix pattern detection,” arXiv preprint
for Automated Repair of C Programs,” IEEE Transactions on Software
arXiv:1807.11286, 2018.
Engineering, vol. 41, no. 12, pp. 1236–1256, Dec. 2015.
[17] S. Wang, M. Wen, X. Mao, and D. Yang, “Attention please: Consider
[7] T. Durieux and M. Monperrus, “IntroClassJava: A Benchmark of 297
mockito when evaluating newly proposed automated program repair
Small and Buggy Java Programs,” University of Lille, University of
techniques,” in Proceedings of the Evaluation and Assessment on
Lille, Tech. Rep. #hal-01272126, 2016.
Software Engineering. ACM, 2019, pp. 260–266.
[8] R. K. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. R. Prasad, [18] S. H. Tan, J. Yi, Yulis, S. Mechtaev, and A. Roychoudhury,
“Bugs.jar: A Large-scale, Diverse Dataset of Real-world Java Bugs,” in “Codeflaws: A Programming Competition Benchmark for Evaluating
Proceedings of the 15th International Conference on Mining Software Automated Program Repair Tools,” in Proceedings of the 39th
Repositories (MSR ’18). New York, NY, USA: ACM, 2018, pp. 10–13. International Conference on Software Engineering Companion (ICSE-C
[Online]. Available: http://doi.acm.org/10.1145/3196398.3196473 ’17). Piscataway, NJ, USA: IEEE Press, 2017, pp. 180–182. [Online].
[9] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “QuixBugs: Available: https://doi.org/10.1109/ICSE-C.2017.76
A Multi-Lingual Program Repair Benchmark Set Based on the