0% found this document useful (0 votes)
148 views15 pages

DeepBugs: ML for Automated Bug Detection

The document describes a new approach called DeepBugs for automatically creating bug detectors using machine learning. DeepBugs trains a model to distinguish between buggy and non-buggy code by extracting examples from existing code and synthesizing additional examples by seeding bugs. As a proof of concept, the authors implement DeepBugs to create four JavaScript bug detectors that learn to identify issues like swapped function arguments and incorrect assignments by analyzing semantic similarities between identifier names. Evaluating on a corpus of 150,000 JavaScript files shows the learned detectors have high accuracy and find real bugs.

Uploaded by

zohar oshrat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views15 pages

DeepBugs: ML for Automated Bug Detection

The document describes a new approach called DeepBugs for automatically creating bug detectors using machine learning. DeepBugs trains a model to distinguish between buggy and non-buggy code by extracting examples from existing code and synthesizing additional examples by seeding bugs. As a proof of concept, the authors implement DeepBugs to create four JavaScript bug detectors that learn to identify issues like swapped function arguments and incorrect assignments by analyzing semantic similarities between identifier names. Evaluating on a corpus of 150,000 JavaScript files shows the learned detectors have high accuracy and find real bugs.

Uploaded by

zohar oshrat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning to Find Bugs

Michael Pradel and Koushik Sen

Technical Report TUD-CS-2017-0295

TU Darmstadt, Department of Computer Science


November, 2017
Deep Learning to Find Bugs
Michael Pradel Koushik Sen
Department of Computer Science EECS Department
TU Darmstadt, Germany University of California, Berkeley, USA

Abstract program behavior and other code quality issues, e.g., related
Automated bug detection, e.g., through pattern-based static to readability and maintainability.
analysis, is an increasingly popular technique to find pro- Even though various bug detectors have been created and
gramming errors and other code quality issues. Traditionally, are widely used, there still remain numerous bugs that slip
bug detectors are program analyses that are manually writ- through all available checks. We hypothesize that one im-
ten and carefully tuned by an analysis expert. Unfortunately, portant reason for missing so many bugs is that manually
the huge amount of possible bug patterns makes it difficult creating a bug detector is non-trivial. A particular challenge
to cover more than a small fraction of all bugs. This paper is that each bug detector must be carefully tuned and ex-
presents a new approach toward creating bug detectors. The tended with heuristics to be practical. For example, a recently
basic idea is to replace manually writing a program analysis deployed bug detector that is now part of the Google Error
with training a machine learning model that distinguishes Prone framework comes with various heuristics to increase
buggy from non-buggy code. To address the challenge that the number of detected bugs and to decrease the number of
effective learning requires both positive and negative train- false positives [37].
ing examples, we use simple code transformations that cre- We propose to address the challenge of creating bug detec-
ate likely incorrect code from existing code examples. We tors through machine learning. Even though machine learn-
present a general framework, called DeepBugs, that extracts ing has helped address a variety of software development
positive training examples from a code corpus, leverages tasks [8, 13, 14, 36] and even though buggy code stands out
simple program transformations to create negative training compared to non-buggy code [33], the problem of learning-
examples, trains a model to distinguish these two, and then based bug detection remains an open challenge. A key reason
uses the trained model for identifying programming mis- for the current lack of learning-based bug detectors is that ef-
takes in previously unseen code. As a proof of concept, we fective machine learning requires large amounts of training
create four bug detectors for JavaScript that find a diverse set data. To train a model that effectively identifies buggy code,
of programming mistakes, e.g., accidentally swapped func- learning techniques require many examples of both correct
tion arguments, incorrect assignments, and incorrect binary and incorrect code – typically thousands or even millions
operations. To find bugs, the trained models use information of examples. However, most available code is correct, or at
that is usually discarded by program analyses, such as identi- least it is unknown exactly which parts of it are incorrect. As
fier names of variables and functions. Applying the approach a result, existing bug detectors that infer information from
to a corpus of 150,000 JavaScript files shows that learned bug existing code learn only from correct examples [16] and
detectors have a high accuracy, are very efficient, and reveal then flag any code that is unusual as possibly incorrect, or
132 programming mistakes in real-world code. search for inconsistencies within a single program [11]. Un-
fortunately, to be practical, these approaches rely on built-in
domain knowledge and various carefully tuned heuristics.
1 Introduction In this paper, we address the problem of automatically
Automated bug detection techniques are widely used by detecting buggy code with a technique that learns from both
developers and have received significant attention by re- correct and buggy code. We present a general framework,
searchers. One of the most popular techniques are light- called DeepBugs, that extracts positive training examples
weight, lint-style, static analyses that search for instances from a code corpus, applies a simple transformation to also
of known bug patterns. Typically, such analyses are created create large amounts of negative training examples, trains a
as part of a framework that supports an extensible set of model to distinguish these two, and to finally uses the trained
bug patterns, or rules, such as Google Error Prone [1] , Find- model for identifying mistakes in previously unseen code.
Bugs [18], or lgtm 1 . Each of these frameworks contain at The key idea that enables DeepBugs to learn an effective
least several dozens, sometimes even hundreds, of bug de- model is to synthesize negative training examples by seed-
tectors, i.e., individual analyses targeted at a specific bug ing bugs into existing, presumably correct code. To create
pattern. The term “bug” here refers to a wide spectrum of negative examples for a particular bug pattern, all that is re-
problems that developers should address, including incorrect quired is a simple code transformation that creates instances
of the bug pattern. For example, to detect bugs caused by
1 [Link] accidentally swapping arguments passed to a function, the
1
transformation swaps arguments in existing code, which is enable name-based bug detectors to reason about seman-
likely to yield a bug. tic similarities of programmer-chosen identifier names.
To instantiate the framework, we focus on name-based bug (Section 3)
detectors, i.e., a class of bug detectors that exploit implicit • Four name-based bug detectors created with the frame-
semantic information provided through identifier names. For work that target a diverse set of programming errors. In
example, suppose a function call setSize(height, width) contrast to existing name-based bug detectors, they do not
that is incorrect because the arguments are passed in the rely on manually tuned heuristics. (Section 4)
wrong order. A name-based bug detector would pinpoint • Empirical evidence that learned bug detectors have high
this bug by comparing the names of the arguments height accuracy, are efficient, and reveal various programming
and width to the names of the formal parameters and of mistakes in real-world code. (Section 6)
arguments passed at other call sites. Such information has
Our implementation is available as open-source:
been used in manually created analyses that revealed various
[Link]
name-related programming mistakes [23, 31, 37].
However, due to the fuzzy nature of programmer-chosen
identifier names, manually creating name-based bug detec- 2 A Framework for Learning to Find Bugs
tors is challenging and heavily relies on well-tuned heuristics. This section presents the DeepBugs framework for automat-
In particular, existing techniques consider lexical similarities ically creating bug detectors via machine learning. The basic
between names, e.g., between height and myHeight, but idea is to train a classifier to distinguish between code that
miss semantic similarities, e.g., between height and y_dim. is an instance of a specific bug pattern and code that does
To enable DeepBugs to reason about such semantic similar- not suffer from this bug pattern. By bug pattern, we infor-
ities, it automatically learns embeddings, i.e., dense vector mally mean a class of recurring programming errors that
representations, of identifier names. The embedding assigns are similar because they violate the same rule. For example,
similar vectors to semantically similar names, such as height accidentally swapping the arguments passed to a function,
and y_dim, allowing DeepBugs to detect otherwise missed calling the wrong API method, or using the wrong binary op-
bugs and to avoid otherwise reported spurious warnings. erator are bug patterns. Manually written bug checkers, such
As a proof of concept, we create four bug detectors for as FindBugs or Error Prone, are based on bug patterns, each
JavaScript that find a diverse set of programming mistakes, of which corresponds to a separately implemented analysis.
e.g., accidentally swapped function arguments, incorrect as-
signments, and incorrect binary operations. We evaluate 2.1 Overview
DeepBugs and its four instances by learning from a corpus Given a corpus of code, creating and using a bug detector
of 100,000 JavaScript files and by searching mistakes in an- based on DeepBugs consists of several steps. Figure 1 illus-
other 50,000 JavaScript files. In total, the corpus amounts to trates the process with a simple example.
68 million lines of code. We find that the learned bug detec- 1. Extract and generate training data from the corpus. This
tors have an accuracy between 84.23% and 94.53%, i.e., they step statically extracts positive code examples from the
are very effective at distinguishing correct from incorrect given corpus and generates negative code examples by ap-
code. Manually inspecting a subset of the warnings reported plying a simple code transformation. Because we assume
by the bug detectors, we found 132 real-world bugs and code that most code in the corpus is correct, each extracted
quality problems. Even though we do not perform any man- code example is likely to not suffer from the particular
ual tuning or filtering of warnings, the bug detectors have a bug pattern. To also create negative training examples,
reasonable precision (roughly half of all inspected warnings DeepBugs applies simple code transformations that are
point to actual problems) that is comparable to manually likely to introduce a bug. (Step 1 in Figure 1.)
created bug detectors. 2. Represent code as vectors. This step translates each code
In summary, this paper contributes the following: example into a vector. To preserve semantic information
• A general framework for learning bug detectors by train- conveyed by identifier names, we use a learned embedding
ing a model with both positive and negative examples. of identifiers to compute the vector representation. (Step 2
(Section 2) in Figure 1.)
• The observation that simple code transformations applied 3. Train a model to distinguish correct and incorrect examples.
to existing code yield negative training examples that en- Given two sets of code examples that contain positive and
able the training of an effective bug detector. negative examples, respectively, this step trains a machine
• A novel approach to derive embeddings of identifiers and learning model to distinguish between the two. (Step 3 in
literals, i.e., distributed vector-representations of code ele- Figure 1.)
ments often ignored by program analyses. The embeddings 4. Predict bugs in previously unseen code. This step applies the
classifier obtained in the previous step to predict whether
a previously unseen piece of code suffers from the bug
2
(1) Training (2) Embeddings
data generator Positive examples for identifiers

Corpus of code setSize(width, height) Vector representations


setSize(width, height) (0.37, -0.87, 0.04, ..)
Negative examples (-0.13., 0.63, 0.38, ..)
setSize(height, width)
(3) Train
(4) Embeddings (5) Query classifier
Previously unseen code for identifiers Vector representations classifier Learned (6) Predict Likely bugs and code
setDim(y_dim, x_dim) (-0.12, 0.67, 0.35, ..) bug detector quality problems
Incorrectly ordered arguments?
setDim(y_dim, x_dim)

Figure 1. Overview of our approach.

pattern. If the learned model classifies the code to be likely and negative examples by permuting the order of these argu-
incorrect, the approach reports a warning to the developer. ments of the function calls. Under the assumption that the
(Steps 4 to 6 in Figure 1.) given code is mostly correct, the unmodified calls are likely
correct, whereas changing the order of arguments is likely
2.2 Generating Training Data to provide an incorrect call.
An important prerequisite for any learning-based approach 2.3 Training and Querying a Bug Detector
is a sufficiently large amount of training data. In this work,
we formulate the problem of bug detection as a binary classi- Given positive and negative training examples, our approach
fication task addressed via supervised learning. To effectively learns a bug detector that distinguishes between them. Ma-
address this task, our approach relies on training data for chine learning-based reasoning about programs relies on a
both classes, i.e., both examples of correct and incorrect representation of code that is suitable for learning.
code. As observed by others [9, 28, 35], the huge amounts Definition 2.2 (Code representation). Given a piece of code
of existing code provide ample of examples of likely correct c ∈ C, its code representation v ∈ Rn is a n-dimensional real-
code. In contrast, it is non-trivial to obtain many examples valued vector.
of code that suffers from a particular bug pattern. One pos- As a valid alternative to a vector-based code representa-
sible approach is to manually or semi-automatically search tion, one could feed a graph representation of code into a
code repositories and bug trackers for examples of bugs that suitable machine learning model, such as (gated) graph neu-
match a given bug pattern. However, scaling this approach ral networks [21, 38] or recursive neural networks [39]. To
to thousands or even millions of examples, as required for obtain the vector-based code representation, we use a graph
advanced machine learning, is extremely difficult. representation of code (ASTs, details in Section 3).
Instead of relying on manual effort for creating training Based on the vector representation of code, a bug detector
data, this work generates training data fully automatically is a model that distinguishes between vectors that corre-
from a given corpus of code. The key idea is to apply a simple spond to correct and incorrect code examples, respectively.
code transformation τ that transforms likely correct code
extracted from the corpus into likely incorrect code. Section 4 Definition 2.3 (Bug detector). A bug detector D is a binary
presents implementations of τ that apply simple AST-based classifier D : C → [0, 1] that predicts the probability that a
code transformations. piece of code c ∈ C is an instance of a particular bug pattern.
Training a bug detector consists of two steps. At first,
Definition 2.1 (Training data generator). Let C ⊆ L be a set DeepBugs computes for each positive example cpos ∈ Cpos
of code in a programming language L. Given a piece of code its vector representation vpos ∈ Rn , which yields a set Vpos of
c ∈ C, a training data generator G : c → (Cpos , Cneд ) creates vectors. Likewise, the approach computes the set Vneд from
two sets of code Cpos ⊆ C and Cneд ⊆ L, which contain the negative examples c neд ∈ Cneд . Then, we train the bug
positive and negative training examples, respectively. The detector D in a supervised manner by providing two kinds of
negative examples are created by applying transformation input-output pairs: (vpos , 0) and (vneд , 1). That is, the model
τ : C → C to each positive example: is trained to predict that positive code examples are correct
Cneд = {c neд | c neд = τ (cpos ) ∀cpos ∈ Cpos } and that negative code examples are incorrect. In principle,
the bug detector can be implemented by any classification
There are various ways to implement a training data gen- technique. We use a feedforward neural network with a
erator. For example, suppose the bugs of interest are acciden- single hidden layer and a single-element output layer that
tally swapped arguments of function calls. A training data represents the probability computed by D.
generator for this bug pattern gathers positive examples by Given a sufficiently large set of training data, the bug
extracting all function calls that have at least two arguments detector will generalize beyond the training examples and
3
one can query it with previously unseen code. To this end, DeepBugs reasons about identifiers by automatically learn-
DeepBugs extracts pieces of code Cnew in the same way ing a vector representation for each identifier based on a
as extracting the positive training data. For example, for a corpus of code. The vector representation, also called em-
bug detector that identifies swapped function arguments, bedding, assigns to each identifier a real-valued vector in
the approach extracts all function calls including their un- a k-dimensional space. Let I be the set of all identifiers in
modified arguments. Next, DeepBugs computes the vector a code corpus. An embedding is a function E : I → Rk . A
representation of each example c new ∈ Cnew , which yields a naïve representation is a local, or one-hot, encoding, where
set Vnew . Finally, we query the trained bug detector D with k = |I | and where each vector returned by E contains only
every vnew ∈ Vnew and obtain for each piece of code a predic- zeros except for a single element that is set to one and that
tion whether it is incorrect. To report warnings about bugs represents the specific token. Such a local representation fails
the a developer, DeepBugs ranks warnings by the predicted to provide two important properties. First, to enable efficient
probability in descending order. In addition, one can control learning, we require an embedding that stores many identi-
the overall number of warnings by omitting all warnings fiers in relatively short vectors. Second, to enable DeepBugs
with a probability below a configurable threshold. to generalize across non-identical but semantically similar
identifiers, we require an embedding that assigns a similar
vector to semantically similar identifiers.
3 Vector Representations for Identifiers Instead of a local embedding, we use a distributed em-
and Literals bedding, where the information about an identifier in I is
distributed across all elements of the vector returned by E.
The general framework from Section 2 is applicable to vari-
Our distributed embedding is inspired by word embeddings
ous kinds of bug detectors. In the remainder of this paper, we
for natural languages, specifically by Word2Vec [25]. The
focus on name-based bug detectors, which reason about the
basic idea of Word2Vec is that the meaning of a word can be
semantics of code based on the implicit information provided
derived from the various contexts in which this word is used.
by developer-chosen identifier names. Name-based bug de-
In natural languages, the context of an occurrence of a word
tection has been shown to be effective in detecting otherwise
w in a sequence of words is the window of words preced-
missed bugs [23, 31] and has recently been adopted by a ma-
ing and succeeding w. An obvious way to adapt this idea to
jor software company [37]. For example, the name-based
source code would be to view code as a sequence of tokens
bug detector in [37] detects accidentally swapped function
and to define the context of the occurrence of an identifier
arguments by comparing the identifier names of actual argu-
as its immediate preceding and succeeding tokens. However,
ments at call sites to the formal parameter names at the cor-
we observe that the surrounding tokens often contain purely
responding function definition. Besides showing the power
syntactic artifacts that are irrelevant to the semantic mean-
of name-based bug detection, prior work also shows that
ing of an identifier. Moreover, viewing code as a sequence of
manually developing an effective analysis requires various
tokens discards richer, readily available representations of
manually developed and fine-tuned heuristics to reduce false
code.
positives and to increase the number of detected bugs.
The main challenge for reasoning about identifiers is that 3.1 AST-based Context
understanding natural language information is non-trivial
We define the context of an identifier based on the AST of
for computers. Our goal is to distinguish semantically simi-
the code. The basic idea is to consider the nodes surrounding
lar identifiers from dissimilar ones. In addition to identifiers,
an identifier node n as its context. The idea is based on
we also consider literals in code, such as true and 23, be-
the observation that the surrounding nodes often contain
cause they also convey relevant semantic information that
useful context information, e.g., about the syntactic structure
can help to detect bugs. To simplify the presentation, we
that n is part of and about other identifiers that n relates to.
say “identifier” to denote both identifiers and literals. For
Formally, we define the context as follows.
example, seq and list are similar because both are likely
to refer to ordered data structures. Likewise, true and 1 are Definition 3.1 (AST context). Given a node n in an abstract
similar (in JavaScript, at least) because both evaluate to true syntax tree, let p and д be the parent and grand-parent of n,
when being used in a conditional. In contrast, height and respectively, and let S, U , C, N be the sets of siblings, uncles,
width are semantically dissimilar because they refer to op- cousins, and nephews of n, respectively. Furthermore, let ppos
posing concepts. As illustrated by these examples, semantic and дpos denote the relative position of n in the sequence of
similarity does not always correspond to lexical similarity, children of n’s parent and grand-parent, respectively. The
as considered by prior work [23, 31, 37], and may even cross AST context of n is a tuple (p, ppos , д, дpos , S, U , C, N ).
type boundaries. To enable a machine learning-based bug
In essence, the AST context contains all nodes surround-
detector to reason about tokens, we require a representation
ing the node n, along with information about the relative
of identifiers that preserves semantic similarities.
positioning of n. We omit the children of n in the definition
4
... n. Finally, the approach derives the embedding E(n) from the
window . setTimeout ( callback , 1000) ; internal representation that the neural network chooses for
...
(a) JavaScript code. n. The following presents these steps in detail.
Given a corpus of code, the approach traverses the ASTs
of all code in the corpus. For each identifier node n, we ex-
Body tract the AST context c(n), as defined above. To feed the pair
(n, c(n)) into a neural network, we compute vector repre-
... CallExpr ...
sentations of both n and c(n). For a pair (n, c(n)), the vector
MemberExpr Identifier: Literal: representation of n is its one-hot encoding. For the context
callback 1000 c(n), we compute the vector as follows.
Identifier: Identifier
window setTimeout
Definition 3.2 (AST context vector). Given an AST context
(b) Abstract syntax tree (slightly simplified). Nodes with dotted (p, ppos , д, дpos , S, U , C, N ), its vector representation is the
boxes are in the context for the node with the red, solid border.1 concatenation of the following vectors:
• A one-hot encoding of p.
• A single-element vector that contains ppos .
Part of context Value(s) • A one-hot encoding of д.
Parent p MemberExpr • A single-element vector that contains дpos .
Position ppos in parent 2 • For each X ∈ {S, U , C, N }, a vector obtained by bit-wise
Grand-parent д CallExpr adding the one-hot encodings of all elements of X .
Position дpos in grand-parent 1
Siblings S { ID:window } For the example in Figure 2, Figure 2d shows the vector
Uncles U { ID:callback, LIT:1000 } representation of the context. The individual subvectors are
Cousins C {} split by “|” for illustration.
Nephews N {} The overall length of the AST context vector is 6 ∗ |Vc | + 2,
where |Vc | is the size of the vocabulary for context strings. We
(c) AST context for setTimeout. summarize the set of siblings, uncles, cousins, and nephews
into four vectors because different nodes may have different,
0..1..0|2|0..1..0|1|0..1..0|0..1..1..0|0..0|0..0 possibly very large numbers of siblings, etc. An alternative
would be to concatenate the one-hot encodings of all siblings,
(d) AST context vector.
etc. However, the possibly very large number of siblings,
etc. would either yield very large context vectors, which is
Figure 2. AST context for identifier setTimeout.
inefficient, or require us to omit siblings, etc. beyond some
maximum number, which would discard potentially valuable
because identifiers are leaf nodes in common AST represen- information.
tations. For efficiency during training, we limit the vocabulary
For example, consider the JavaScript code and its corre- |Vn | of identifiers and literals to 10, 000 and the vocabulary
sponding AST in Figure 2. The table in Figure 2c shows the |Vc | of strings occurring in the context to 1, 000 by discarding
context extracted for the identifier setTimeout. the least frequent strings. To represents strings beyond Vn
Each value in the context is a string value. To compute and Vc , we use a placeholder “unknown”.
this value from an AST node n we use a function str so that After computing pairs of vectors (vn , vc ) for each occur-
str (n) is rence of an identifier in the corpus, the next step is to use
• the identifier name if n is an identifier node, these pairs as training data to learn embeddings for identi-
• the string representation of the literal if n is a literal node, fiers and literals. The basic idea is to train a feedback-forward
and neural network to predict the context of a given identifier.
• the type of the non-terminal of the abstract grammar that The network consists of three layers:
n represents, for all other nodes. 1. An input layer of length |Vn | that contains the vector vn .
2. A hidden layer of length e ≪ |Vn |, which forces the net-
3.2 Learning Embeddings for Identifiers and work to summarize the input layer to length e.
Literals 3. An output layer of length 6 ∗ |Vc | + 2 that contains the
To learn embeddings for identifiers n from a corpus of code, context vector vc .
our approach proceeds in three steps. At first, it extracts The layers are densely connected through a linear and sig-
the context c(n) of each occurrence of an identifier in the moid activation function, respectively. We train the network
corpus. Then, it trains a neural network to predict c(n) from using all pairs (vn , vc ) with the standard backpropagation
5
algorithm, so that the network becomes more and more ac- Expression Extracted name
curate at predicting vc for the given vn . We use binary cross-
list ID:list
entropy as the loss function and the Adam optimizer [20].
23 LIT:23
Our approach is related to but differs in two ways from
this LIT:this
the skip-gram variant of the popular Word2Vec embedding
i++ ID:i
for natural languages [25]. First, we exploit a structural rep-
[Link] ID:prop
resentation of source code, an AST, to extract the context of
myArray[5] ID:myArray
each identifier, instead of simply using the surrounding to-
nextElement() ID:nextElement
kens. Second, our neural network predicts the entire context
[Link]()[3] ID:allNames
vector, including all context information, instead of predict-
ing whether a single word occurs in the context of a word, Table 1. Examples of identifier names and literals extracted
as in [25]. The rationale is to preserve the structural infor- for name-based bug detectors.
mation encoded in the AST context vector, e.g., whether a
related identifier is a sibling or an uncle. These design deci-
sions are well-suited for program code because, in contrast
to natural languages, it has a precisely defined structural and literals, allowing us to amortize the one-time effort of
representation. learning an embedding across different bug detectors.
Since a single identifier may occur in different context, Given these two ingredients and a corpus of training code,
the training data may contain pairs (vn , vc1 ) and (vn , vc2 ) our framework learns a bug detector that identifies program-
with vc1 , vc2 . Such data forces the network to reconcile ming mistakes in previously unseen code.
different occurrences of the same identifier, and the network All bug detectors presented in this section share the same
will learn to predict a context vector that, on average, gets technique for extracting names of expressions. Given an AST
closest to the expected vectors. node n that represents an expression, we extract name(n) as
During training, the network learns to efficiently repre- follows:
sent identifiers in a vector of length e that summarizes all • If n is an identifier, return its name.
information required to predict the context of the identifier. • If n is a literal, return a string representation of its value.
Once the network has been trained, we query the network • If n is a this expression, return “this”.
once again for each identifier and extract the value of the • If n is an update expression that increments or decrements
hidden layer, which then serves as the embedding for the x, return name(x).
identifier. The overall result of learning embeddings is a map • If n is a member expression [Link] that accesses a prop-
E : I → Re that assigns an embedding to each identifier. erty, return name(prop).
• If n is a member expression base[k] that accesses an array
element, return name(base).
4 Example Bug Detectors • If n is a call expression [Link](..), return name(callee).
• For any other AST node n, do not extract its name.
To validate our DeepBugs framework, we instantiate it with
Table 1 gives examples of names extracted from JavaScript
four bug patterns. They address a diverse set of program-
expressions. We use the prefixes “ID:” and “LIT:” to distin-
ming mistakes, including accidentally swapped function ar-
guish identifiers and literals. The extraction technique is
guments, incorrect assignments, and incorrect binary opera-
similar to that used in manually created name-based bug
tions. Implementing new bug detectors is straightforward,
detectors [23, 31, 37], but omits heuristics to make the ex-
and we envision future work to create more instances of
tracted name suitable for a lexical comparison of names.
our framework, e.g., based on bug patterns mined from ver-
For example, existing techniques remove common prefixes,
sion histories [10]. Each bug detector consists of two simple
such as get to increase the lexical similarity between, e.g.,
ingredients.
getNames and names. Instead, we identify semantic similari-
• Training data generator. A training data generator that tra- ties of names through an embedding and by finding related
verses the code corpus and extracts positive and negative names in similar code examples.
code examples for the particular bug pattern based on a The remainder of this section presents four bug detectors
code transformation. We find a simple AST-based traver- build on top of DeepBugs.
sal and transformation to be sufficient for all studied bug
patterns. 4.1 Swapped Function Arguments
• Code representation. A mapping of each code example into
The first bug detector addresses accidentally swapped ar-
a vector that the machine learning model learns to classify
guments. This kind of mistake can occur both in statically
as either benign or buggy. All bug detectors presented
typed languages, for methods that accept multiple equally
here build on the same embedding of identifier names
typed arguments, and in statically untyped languages, where
6
all calls that pass two or more arguments are susceptible to 4.2.1 Training Data Generator
the mistake. To extract training data, we traverse the AST of each file and
extract the following information for each assignment:
4.1.1 Training Data Generator • The names nlhs and nr hs of the left-hand side and the
To create training examples from given code, we traverse right-hand side of the assignment.
the AST of each file in the code corpus and visit each call • The type tr hs of the right-hand side if the assigned value
site with two or more arguments. For each such call site, the is a literal, or an empty string otherwise.
approach extracts the following information: If either nlhs or nr hs cannot be extracted, then we ignore the
• The name ncall ee of the called function. assignment.
• The names nar д1 and nar д2 of the first and second argu- Based on the extracted information, the approach cre-
ment. ate a positive and a negative training example for each
• The name nbase of the base object if the call is a method assignment: The positive example cpos = (nlhs , nr hs , tr hs )
call, or an empty string otherwise. keeps the assignment as it is, whereas the negative example
• The types tar д1 and tar д2 of the first and second argument cpos = (nlhs , nr′ hs , tr′hs ) replaces the right-hand side with an
for arguments that are literals, or empty strings otherwise. alternative, likely incorrect expression. To find this alterna-
• The names npar am1 and npar am2 of the formal parameters tive right-hand side, we gather the right-hand sides of all
of the called function, or empty strings if unavailable. assignments in the same file and randomly select one that
All names are extracted using the name function defined differs from the original right-hand side, i.e., nr hs , nr′ hs .
above. We resolve function calls heuristically, as sound static The rationale for picking the alternative right-hand side from
call resolution is non-trivial in JavaScript. If either ncall ee , the same file is to create realistic negative examples.
nar д1 , or nar д2 are unavailable, e.g., because the name func-
tion cannot extract the name of a complex expression, then 4.2.2 Code representation
the approach ignores this call site.
The vector representations of the training examples is similar
From the extracted information, the training data genera-
to Section 4.1. Given a tuple that represents a training exam-
tor creates for each call site a positive example cpos = (nbase ,
ple, we map each string in the tuple to a vector using E (for
ncall ee , nar д1 , nar д2 , tar д1 , tar д2 , npar am1 , npar am2) and a neg-
names) or T (for types) and then concatenate the resulting
ative example c neд = (nbase , ncall ee , nar д2 , nar д1 , tar д2 , tar д1 ,
vectors.
npar am1 , npar am2) . That is, to create the negative example,
we simply swap the arguments w.r.t. the order in the original
code. 4.3 Wrong Binary Operator
The next two bug detectors address mistakes related to binary
4.1.2 Code representation operations. At first, we consider code that accidentally uses
To enable DeepBugs to learn from the positive and nega- the wrong binary operator, e.g., i <= length instead of i <
tive examples, we transform cpos and c neд from tuples of length.
strings into vectors. To this end, the approach represents
each string in the tuple cpos or c neд as a vector. Each name 4.3.1 Training Data Generator
n is represented as E(n), where E is the learned embedding The training data generator traverses the AST of each file in
from Section 3, i.e., E(n) is the embedding vector of the name the code corpus and extracts the following information:
n. To represent type names as vectors, we define a function T • The names nl ef t and nr iдht of the left and right operand.
that maps each built-in type in JavaScript to a randomly cho- • The operator op of the binary operation.
sen binary vector of length 5. For example, the type “string” • The types tl ef t and tr iдht of the left and right operand if
may be represented by a vector T (string) = [0, 1, 1, 0, 0], they are literals, or empty strings otherwise.
whereas the type “number” may be represented by a vector • The kind of AST node kpar ent and kдr and P of the parent
T (string) = [1, 0, 1, 1, 0]. Finally, based on the vector rep- and grand-parent nodes of the AST node that represents
resentation of each element in the tuple cpos or c neд , we the binary operation.
compute the code representation for cpos or c neд as the con- We extract the (grand-)parent nodes to provide some context
catenation the individual vectors. about the binary operation to DeepBugs, e.g., whether the
operation is part of a conditional or an assignment. If either
4.2 Incorrect Assignments nl ef t or nr iдht are unavailable, then we ignore the binary
The next bug detector checks for assignments that are incor- operation.
rect because the right-hand side is not the expression that the From the extracted information, the approach creates a
developer intended to assign, e.g., due to a copy-and-paste positive and a negative example: cpos = (nl ef t , nr iдht , op,
mistake. tl ef t , tr iдht , kpar ent , kдr and P ) and c neд = (nl ef t , nr iдht , op ′,
7
tl ef t , tr iдht , kpar ent , kдr and P ). The operator op ′ , op is a ran- Bug detector Examples
domly selected binary operator different from the original op-
Training Validation
erator. For example, given a binary expression i <= length,
the approach may create a negative example i < length or Swapped arguments 1,450,932 739,188
i % length, which is likely to create incorrect code. Wrong assignment 2,274,256 1,090,452
Wrong bin. operator 4,901,356 2,322,190
4.3.2 Code representation Wrong bin. operand 4,899,206 2,321,586
Similar to the above bug detectors, we create a vector repre- Table 2. Statistics on extraction and generation of training
sentation of each positive and negative example by mapping data.
each string in the tuple to a vector and by concatenating the
resulting vectors. To map a kind of AST node k to a vector,
we use a map K that assigns to each kind of AST node in
JavaScript a randomly chosen binary vector of length 8.

4.4 Wrong Operand in Binary Operation large majority of our implementation is in the generic frame-
The final bug detector addresses code that accidentally uses work, whereas the individual bug detectors are implemented
an incorrect operand in a binary operation. The intuition is in about 100 lines of code each.
that a trained machine learning model can identify whether The following provides some details on the neural net-
an operand fits to another given operand and a given bi- works we use. The classifier that represents the bug detector
nary operator. For example, the bug detector identifies an is a feedforward network with an input layer of a size that
operation height - x that was intended to be height - y. depends on the code representation provided by the specific
bug detector, a single hidden layer of size 200, and an output
4.4.1 Training Data Generator layer with a single element. We apply a dropout of 0.2 to
Again, the training data generator extracts the same informa- the input layer and the hidden layer. We use binary cross-
tion as in Section 4.3, and then replaces one of the operands entropy as the loss function and train with the RMSprop
with a randomly selected alternative. That is, the positive ex- optimizer for 10 epochs with batch size 100. The network
ample is cpos = (nl e f t , nr iдht , op, tl e f t , tr iдht , kpar ent , kдr and P ), that learns embeddings for identifiers is also a feedforward
whereas the negative example is either c neд = (nl′e f t , nr iдht , network with a single hidden layer that contains the learned
embedding of size 200. We use binary cross-entropy as the
op, tl′e f t , tr iдht , kpar ent , kдr and P ) or c neд = (nl e f t , nr′ iдht , op,
loss function and train for two epochs with a batch size of
tl ef t , tr′iдht , kpar ent , kдr and P ). The name and type nl′ef t and 50 using the the Adam optimizer.
tl′ef t (or nr′ iдht and tr′iдht ) are different from those in the
positive example. To create negative examples that a pro-
grammer might also create by accident, we use alternative 6 Evaluation
operands that occur in the same file as the binary operation. We evaluate DeepBugs by applying it to a large corpus of
For example, given bits << 2, the approach may transform JavaScript code. Our main research questions are: (i) How
it into a negative example bits << next, which is likely to effective is the approach at distinguishing correct from in-
yield incorrect code. correct code? (ii) Does the approach find bugs in production
JavaScript code? (iii) How long does it take to train a model
4.4.2 Code representation and, once a model has been trained, to predict bugs? (iv) How
The vector representation of the positive and negative exam- useful are the learned embeddings of identifiers compared
ples is the same as in Section 4.3. to a simpler vector representation?

5 Implementation 6.1 Experimental Setup


The code extraction and generation of training examples is As a corpus of code, we use 150,000 JavaScript files pro-
implemented as simple AST traversals based on the Acorn vided by the authors of earlier work.4 The corpus contains
JavaScript parser.2 The training data generator writes all files collected from various open-source projects and has
extracted data into text files. These files are then read by the been cleaned by removing duplicate files. In total, the corpus
implementation of the bug detector, which builds upon the contains 68.6 million lines of code. We use 100,000 files for
TensorFlow and Keras frameworks for deep learning.3 . The training and the remaining 50,000 files for validation. All
experiments are performed on a single machine with 48 In-
2 [Link] tel Xeon E5-2650 CPU cores, 64GB of memory, and a single
3 [Link] and [Link] NVIDIA Tesla P100 GPU.
8
Embedding The recall of a bug detector is influenced by how many
warnings the detector reports. More warnings are likely to
Random AST-based
reveal more bugs, but in practice, developers are only willing
Swapped arguments 93.88% 94.53% to inspect some number of warnings. To measure recall, we
Wrong assignment 77.73% 84.23% assume that a developer inspects all warnings where the
Wrong bin. operator 89.15 91.68% probability D(c) is above some threshold. We model this
Wrong bin. operand 84.79 88.55% process by turning D into a boolean function:
Table 3. Accuracy of bug detectors. The last column shows 
1 if D(c) > t
DeepBugs applied to previously unseen code. D t (c) =
0 if D(c) ≤ t
where t is a configurable threshold that controls how many
warnings to report. Based on D t , we compute recall as fol-
6.2 Extraction and Generation of Training Data lows:
|{c | c ∈ Cneд ∧ D t (c) = 1}|
Table 2 summarizes the training and validation data that recall =
|Cneд |
DeepBugs extracts and generates for the four bug detectors.
Each bug detector learns from several millions of examples,
Figure 3 shows the accuracy of the four bug detectors as
which is sufficient for effective learning. Half of the examples
a function of the threshold for reporting warnings. Each
are positive and negative code examples, respectively. Manu-
plot shows the results for nine different thresholds: t ∈
ally creating this amount of training data, including negative
{0.1, 0.2, ..., 0.9}. As expected, the recall decreases when the
examples, would be impractical, showing the benefit of our
threshold increases, because fewer warnings are reported
automated data generation approach.
and therefore some bugs are missed. The results also show
that some bug detectors are more likely than others to detect
6.3 Accuracy and Recall of Bug Detectors
a bug, if a bug is present.
To evaluate the effectiveness of bug detectors built with
DeepBugs, we conduct two sets of experiments. First, re- 6.4 Warnings in Real-World Code
ported in the following, we conduct a large-scale evaluation
The experiments reported in Section 6.3 consider the accu-
with thousands of artificially created bugs, which allows us
racy and recall of the learned bug detectors using artificially
to study the accuracy and the recall of each learned bug
created bugs. We are also interested in the effectiveness of
detector. Second, reported in Section 6.4, we apply the bug
DeepBugs in detecting programming mistakes in real-world
detectors to unmodified real-world code and manually in-
code. To study that question, we train each bug detector with
spect the reported warnings to assess the precision of the
the 100,000 training files, then apply the trained detector to
learned bug detectors.
the unmodified 50,000 validation files, and finally inspect
To goal of the first sets of experiment is to study the accu-
racy and recall of each bug detector at a large scale. Infor- code locations that each bug detector reports as potentially
mally, accuracy here means how many of all the classification incorrect. For each bug detector, we inspect all warnings
decisions that the bug detector makes are correct. Recall here reported with threshold t = 0.99, which yields a total of 290
means how many of all bugs in a corpus of code that the warnings. After inspection, we classify each warning in one
bug detector finds. To evaluate these metrics, we train each of three categories:
bug detector on the 100,000 training files and then apply it • A warning points to a bug if the code is incorrect in the
to the 50,000 validation files. For the validation files, we use sense that it does not result in the expected runtime be-
the training data generator to extract correct code examples havior.
Cpos and to artificially create likely incorrect code examples • A warning points to a code quality problem if the code
Cneд . We then query the bug detector D with each example yields the expected runtime behavior but should never-
c, which yields a probability D(c) that the example is buggy. theless be changed to be less error-prone. This category
Finally, we compute the accuracy is as follows:
includes code that violates widely accepted conventions
|{c | c ∈ Cpos ∧ D(c) < 0.5}| + |{c | c ∈ Cneд ∧ D(c) ≥ 0.5}| and programming rules traditionally checked by static
accuracy =
|Cpos | + |Cneд | linting tools.
• A warning is a false positive in all other cases. If we are
The last column of Table 3 shows the accuracy of the bug unsure about the intended behavior of a particular code
detectors. The accuracy ranges between 84.23% and 94.53%, location, we conservatively count it as a false positive. We
i.e., all bug detectors are highly effective at distinguishing also encountered various code examples with misleading
correct from incorrect code examples. identifier names, which we classify as false positives be-
cause the decision whether an identifier is misleading is
4 [Link] rather subjective.
9
AST embedding
Random embedding
Swapped arguments Wrong assignment Wrong binary operator Wrong binary operand
1 1 1 1

0.8 0.8 0.8 0.8


0.6 0.6 0.6

Recall

Recall

Recall
0.6
Recall

0.4 0.4 0.4 0.4


0.2 0.2 0.2 0.2
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Threshold for reporting warnings Threshold for reporting warnings Threshold for reporting warnings Threshold for reporting warnings

Figure 3. Recall of four bug detectors with different thresholds t for reporting warnings. Each plot contains nine data points
obtained with t ∈ {0.1, 0.2, ..., 0.9}. The data labeled “AST embedding” corresponds to the DeepBugs approach.

Bug detector Reported Bugs Code quality False Buggy call of [Link] The following code is from
problem positives Apigee’s JavaScript SDK. The [Link] function ex-
pects an error followed by a result, but line 6 passes the
Swapped arguments 178 75 10 93
arguments the other way around.
Wrong assignment 24 1 1 22
1 var p = new Promise () ;
Wrong bin. operator 50 14 17 19
2 if ( promises === null || promises . length === 0) {
Wrong bin. operand 38 10 4 22 3 p . done ( error , result )
4 } else {
Total 290 100 32 156 5 promises [0]( error , result ) . then ( function ( res , err ) {
Table 4. Results of inspecting and classifying warnings in 6 p . done ( res , err ) ;
7 }) ;
real-world code. 8 }

Buggy call of assertEquals The following code is from


Table 4 summarizes the results of inspecting and classify- the test suite of the Google Closure library. The arguments
ing warnings. The four bug detectors report a total of 290 of assertEquals are supposed to be the expected and the
warnings. 100 of them point to bugs and 32 point to a code actual value of a test outcome. Swapping the argument leads
quality problem, i.e., roughly half of all warnings point to an to an incorrect error message when the test fails, which
actual problem. Given that the bug detectors are learned au- makes debugging unnecessarily hard. Google developers
tomatically and do not filter warnings based on any manually consider this kind of mistake a bug [37].
tuned heuristics, these results are very encouraging. Existing 1 assertEquals ( tree . remove ( ' merry ') , null ) ;
manually created bug detectors typically provide comparable
true positives rates, but heavily rely on heuristics to filter Incorrect operand in for-loop In code from Angular-UI-
likely false positives. Many of the detected problems are dif- Router, the developer compares a numeric index to an array,
ficult to detect with a traditional, not name-based analysis, instead of comparing to the length of the array. The bug has
because the programming mistakes is obvious only when been fixed (independent of us) in version 0.2.16 of the code.
understanding the semantics of the involved identifiers and 1 for ( j = 0; j < param . replace ; j ++) {
literals. We discuss a selection of representative examples in 2 if ( param . replace [ j ]. from === paramVal )
the following. 3 paramVal = param . replace [ j ]. to ;
4 }

6.4.1 Examples of Bugs


Incorrectly ordered binary operands The following is
Buggy call of setTimeout The following code is from An-
from [Link]. The highlighted expression at line 5 is intended
[Link] and has been fixed (independent of our work) in
to alternate between true and false, but is false for all
version 0.9.16 of the code. The first argument of setTimeout
iterations except i=1 and i=2.
is supposed to be a callback function, while the second ar-
gument is supposed to be the delay after which to call that 1 for ( var i = 0; i < this . NR_OF_MULTIDELAYS ; i ++) {
2 // Invert the signal of every even multiDelay
function. 3 outputSamples = mixSampleBuffers ( outputSamples ,
1 browserSingleton . startPoller (100 , 4 this . multiDelays [ i ]. process ( filteredSamples ) ,
2 function ( delay , fn ) { 5 2% i ==0 , this . NR_OF_MULTIDELAYS ) ;
3 setTimeout ( delay , fn ) ; 6 /* ^^^^^^ */
4 }) ; 7 }

10
Bug detector Training Prediction time to extract code examples and of the time to query the
classifier with each example. Running both training and
Extract Learn Extract Predict
prediction on all 150,000 files takes between 31 minutes and
Swapped arguments 7:46 20:29 2:56 5:19 68 minutes per bug detector. The average prediction time
Wrong assignment 2:40 22:45 1:29 4:03 per JavaScript file is below 20 milliseconds. Even though this
Wrong bin. operator 2:44 51:16 1:28 12:16 efficiency is, in parts, due to parallelization, it shows that
Wrong bin. operand 2:44 51:13 1:28 10:09 once a bug detector is trained, using it on new code takes
Table 5. Time (min:sec) required for training and using a very little time.
bug detector across the entire code corpus.
6.6 Usefulness of Embeddings
To evaluate the usefulness of the embeddings that represent
identifiers as vectors (Section 3), we compare these embed-
6.4.2 Examples of Code Quality Problems
dings with a baseline vector representation. The baseline
Error-prone binary operator The following code from assigns to each identifier considered by DeepBugs a unique,
[Link] is correct but using !==, instead of <, as the ter- randomly chosen binary vector of length e, i.e., the same
mination condition of a for-loop is generally discouraged. length as our learned embeddings. We compare AST-based
The reason is that the loop risks to run out-of-bounds when embeddings (Section 3 with the baseline w.r.t. accuracy and
the counter is incremented by more than one or assigned an recall. Table 3 shows in the “Random” column what accuracy
out-of-bounds value, e.g., by an accidental assignment in the the bug detectors achieve with the baseline. Compared to
loop body. the accuracy with the AST-based embedding (last column),
1 for ( var i = 0, len = b . length ; i !== len ; ++ i ) { the AST-based embeddings yield a more accurate classifier.
2 .. Figure 3 compares the recall of the bug detectors with the
3 }
two embeddings. For all bug detectors, the AST-based em-
beddings increase recall, for some bug detectors by around
6.4.3 Examples of False Positives 10%. The reason is that the AST-based embedding enables the
We discuss some representative examples of false positives. bug detector to reason about semantic similarities between
Many of them are related to poor variable names that lead syntactically different code examples, which enables it to
to surprisingly looking but correct code. Another common learn and predict bugs across similar examples. For exam-
reason are wrapper functions, e.g., [Link], for which our ple, the bug detector that searches for swapped arguments
approach extracts a generic name, “max”, that does not con- may learn from examples such as done(error, result)
vey the specific meaning of the value returned by the call that done(res, err) is likely to be wrong, because error
expression. We believe that some of these false positives ≈ err and result ≈ res. We conclude from these results
could be avoided by training with an even larger code cor- that the AST-based embeddings improve the effectiveness of
pus. Another recurring pattern of false positives reported by DeepBugs. At the same time, the bug detectors achieve rela-
the “incorrect assignment” bug detector is code that shifts tively high accuracy and recall even with randomly created
function arguments at the beginning of an overloaded func- embeddings, showing that the overall approach has value
tion: even when no learned embeddings are available.
1 function DuplexWrapper ( options , writable , readable ) {
2 if ( typeof readable === " undefined " ) {
3 readable = writable ;
7 Related Work
4 writable = options ; 7.1 Machine learning and Language Models for
5 options = null ;
6 }
Code
7 ... The recent successes in machine learning have lead to a
8 }
strong interest in applying machine learning techniques to
The code works around the lack of proper function over- programs. Existing approaches address a variety of develop-
loading in JavaScript by assigning the ith argument to the ment tasks, including code completion based on probabilistic
(i + 1)th argument, which leads to surprising but correct models of code [9, 34, 36], predicting fixes of compilation er-
assignments. rors via neural networks [8, 14], and the generation of inputs
for fuzz testing via probabilistic, generative language mod-
6.5 Efficiency els [30] or neural networks [7, 12, 24]. Deep neural networks
Table 5 shows how long it takes to train a bug detector and to also help recommend API usages [13], detect code clones [44],
use it to predict bugs in previously unseen code. The training classify programs into pre-defined categories [26], and adapt
time consists of the time to gather code examples and of time copied-and-pasted code to its surrounding code [4]. A recent
to train the classifier. The prediction time also consist of the survey discusses more approaches [3]. All these approaches
11
exploit the availability of a large number of examples to learn 7.3 Bug Finding
from, e.g., in the form of publicly available code repositories, A related line of research is specification mining [6] and the
and that source code has regularities even across projects use of mined specifications to detect unusual and possibly
written by different developers [17]. We also exploit this incorrect code [22, 32, 42]. In contrast to our work, these
observation but for a different tasks, bug finding, than the approaches learn only from correct examples [16] and then
above work. Another contribution is to augment the training flag any code that is unusual compared to the correct ex-
data provided by the existing code by generating negative amples as possibly incorrect, or search for inconsistencies
training examples through simple code transformations. within a program [11]. Our work replaces the manual effort
Name-based bug detection relates to existing learning- of creating and tuning such approaches by learning and does
based approaches that consider identifiers. JSNice [35] and so effectively by learning from both correct and buggy code
JSNaughty [40] address the problem of recovering meaning- examples.
ful identifiers from minified code, using conditional random Our name-based bug detectors are motivated by manu-
fields and statistical machine translation, respectively. An- ally created name-based program analyses [23, 31, 37]. The
other approach summarizes code into descriptive names that “swapped arguments” bug detector is inspired by the success
can serve, e.g., as method names [5]. The Naturalize tool sug- of a manually developed and tuned name-based bug detector
gests more natural identifiers based on an n-gram model [2]. for this kind of bug [37]. For the other three bug detectors,
Applying such a tool before running our name-based bug we are not aware of any existing approach that addresses
detectors is likely to improve its effectiveness. these problems based on identifiers. Our work differs from
Word embeddings, e.g., Word2Vec approach [25], are widely existing name-based bug detectors (i) by exploiting semantic
used in natural language processing, which has inspired our similarities that may not be obvious to a lexical comparison
embeddings of identifiers. Other recent work has proposed of identifiers, (ii) by replacing manually tuned heuristics to
embeddings for source code, e.g., for API methods [29], or improve precision and recall with automated learning from
for terms that occur both in programs and natural language examples, and (iii) by applying name-based bug detection to
documentation [45]. Our AST-based embedding is targeted a dynamically typed language.
at name-based bug detection. Incorporating another embed-
ding of identifiers into our framework is straightforward.
7.4 Other Related Work
Our idea to create artificial, negative examples to train a
7.2 Machine Learning and Language Models for binary classifier can be seen as a variant of noise-contrastive
Analyzing Bugs estimation [15]. The novelty is to apply this approach to
Even though it has been noted that buggy code stands out programs and to show that it yields effective bug detectors.
compared to non-buggy code [33], little work exists on auto- The code transformations we use to create negative training
matically detecting bugs via machine learning. Murali et al. examples are strongly related to mutation operators [19].
train a recurrent neural network that probabilistically mod- Based on mutation operators, e.g., mined from version histo-
els sequences of API calls and then use it for finding incorrect ries [10], it is straightforward to create further bug detectors
API usages [27]. In contrast to DeepBugs, their model learns based on our framework.
from positive examples only and focuses on bug patterns that
can be expressed via probabilistic automata. Bugram detects
bugs based on an n-gram model of code. Similar to the above, 8 Conclusions
it learns only from positive examples. Choi et al.˜[]Choi2017 This paper addresses the problem of finding bugs by training
train a memory neural network [43] to classify whether code a model that distinguishes correct from incorrect code. The
may produce a buffer overrun. Their model learns from pos- key insight that enables our work is to automatically create
itive and negative examples, but the examples are manually large amounts of training data through simple code transfor-
created and labeled. Moreover, it is not yet known how to mations that insert likely bugs into supposedly correct code.
scale their approach to real-world programs. A key insight of We present a generic framework for generating training data
our work is that simple code transformations provide many and for training bug detectors based on them. Applying the
negative examples, which help learn an effective classifier. framework to name-based bug detection yields automati-
Finally, Wang et al. use a deep belief network to find a vector cally learned bug detectors that discover 132 programming
representation of ASTs, which are used for defect predic- mistakes in real-world JavaScript code. In the long term, we
tion [41]. Their approach marks entire files as likely to (not) envision our work to complement manually designed bug
contain a bug. However, in contrast to our and other bug detectors by learning from existing code and by replacing
finding work, their approach does not pinpoint the buggy most of the human effort required to create the bug detector
location. with computational effort.
12
Acknowledgments [15] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive esti-
mation: A new estimation principle for unnormalized statistical models.
This work was supported in part by the German Federal Min- In Proceedings of the Thirteenth International Conference on Artificial
istry of Education and Research and by the Hessian Ministry Intelligence and Statistics. 297–304.
of Science and the Arts within CRISP, by the German Re- [16] Sudheendra Hangal and Monica S. Lam. 2002. Tracking down software
search Foundation within the Emmy Noether project Conc- bugs using automatic anomaly detection. In International Conference
Sys, by NSF grants CCF-1409872 and CCF-1423645, and by a on Software Engineering (ICSE). ACM, 291–301.
[17] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premku-
gift from Fujitsu Laboratories of America, Inc.
mar T. Devanbu. 2012. On the naturalness of software. In 34th Interna-
tional Conference on Software Engineering, ICSE 2012, June 2-9, 2012,
Zurich, Switzerland. 837–847.
References [18] David Hovemeyer and William Pugh. 2004. Finding bugs is easy. In
[1] Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Companion to the Conference on Object-Oriented Programming, Systems,
Krishnan. 2012. Building Useful Program Analysis Tools Using an Ex- Languages, and Applications (OOPSLA). ACM, 132–136.
tensible Java Compiler. In 12th IEEE International Working Conference [19] Yue Jia and Mark Harman. 2011. An analysis and survey of the devel-
on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, opment of mutation testing. IEEE transactions on software engineering
Italy, September 23-24, 2012. 14–23. 37, 5 (2011), 649–678.
[2] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. [20] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic
2014. Learning natural coding conventions. In Proceedings of the 22nd optimization. arXiv preprint arXiv:1412.6980 (2014).
ACM SIGSOFT International Symposium on Foundations of Software [21] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel.
Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014. 281– 2015. Gated Graph Sequence Neural Networks. CoRR abs/1511.05493
293. (2015).
[3] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles [22] Bin Liang, Pan Bian, Yan Zhang, Wenchang Shi, Wei You, and Yan Cai.
Sutton. 2017. A Survey of Machine Learning for Big Code and Natu- 2016. AntMiner: Mining More Bugs by Reducing Noise Interference.
ralness. arXiv:1709.06182 (2017). In ICSE.
[4] Miltiadis Allamanis and Marc Brockschmidt. 2017. SmartPaste: Learn- [23] Hui Liu, Qiurong Liu, Cristian-Alexandru Staicu, Michael Pradel, and
ing to Adapt Source Code. CoRR abs/1705.07867 (2017). [Link] Yue Luo. 2016. Nomen Est Omen: Exploring and Exploiting Similarities
org/abs/1705.07867 between Argument and Parameter Names. In International Conference
[5] Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Con- on Software Engineering (ICSE). 1063–1073.
volutional Attention Network for Extreme Summarization of Source [24] Peng Liu, Xiangyu Zhang, Marco Pistoia, Yunhui Zheng, Manoel Mar-
Code. In Proceedings of the 33nd International Conference on Machine ques, and Lingfei Zeng. 2017. Automatic Text Input Generation for
Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. 2091– Mobile Testing. In ICSE.
2100. [25] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jef-
[6] Glenn Ammons, Rastislav Bodík, and James R. Larus. 2002. Mining frey Dean. 2013. Distributed Representations of Words and Phrases and
specifications. In Symposium on Principles of Programming Languages their Compositionality. In Advances in Neural Information Processing
(POPL). ACM, 4–16. Systems 26: 27th Annual Conference on Neural Information Processing
[7] M. Amodio, S. Chaudhuri, and T. Reps. 2017. Neural Attribute Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake
Machines for Program Generation. ArXiv e-prints (May 2017). Tahoe, Nevada, United States. 3111–3119.
arXiv:[Link]/1705.09231 [26] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional
[8] Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Neural Networks over Tree Structures for Programming Language
Syntax Errors in Programming Assignments using Recurrent Neural Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial
Networks. CoRR abs/1603.06129 (2016). Intelligence, February 12-17, 2016, Phoenix, Arizona, USA. 1287–1293.
[9] Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: [27] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine.
Probabilistic Model for Code. In Proceedings of the 33nd International 2017. Finding Likely Errors with Bayesian Specifications. CoRR
Conference on Machine Learning, ICML 2016, New York City, NY, USA, abs/1703.01370 (2017). [Link]
June 19-24, 2016. 2933–2942. [28] Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-Based Statistical
[10] David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas W. Language Model for Code. In 37th IEEE/ACM International Conference
Reps. 2017. The care and feeding of wild-caught mutants. In Proceedings on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015,
of the 2017 11th Joint Meeting on Foundations of Software Engineering, Volume 1. 858–868.
ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017. 511–522. [29] Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N.
[11] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Ben- Nguyen. 2017. Exploring API embedding for API usages and applica-
jamin Chelf. 2001. Bugs as Deviant Behavior: A General Approach to tions. In Proceedings of the 39th International Conference on Software
Inferring Errors in Systems Code. In Symposium on Operating Systems Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017. 438–
Principles (SOSP). ACM, 57–72. 449.
[12] Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&Fuzz: [30] Jibesh Patra and Michael Pradel. 2016. Learning to Fuzz: Application-
Machine Learning for Input Fuzzing. CoRR abs/1701.07232 (2017). Independent Fuzz Testing with Probabilistic, Generative Models of Input
[13] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Data. Technical Report TUD-CS-2016-14664. TU Darmstadt.
2016. Deep API learning. In Proceedings of the 24th ACM SIGSOFT [31] Michael Pradel and Thomas R. Gross. 2011. Detecting anomalies in the
International Symposium on Foundations of Software Engineering, FSE order of equally-typed method arguments. In International Symposium
2016, Seattle, WA, USA, November 13-18, 2016. 631–642. DOI:http: on Software Testing and Analysis (ISSTA). 232–242.
//[Link]/10.1145/2950290.2950334 [32] Michael Pradel, Ciera Jaspan, Jonathan Aldrich, and Thomas R. Gross.
[14] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. 2012. Statically Checking API Protocol Conformance with Mined
DeepFix: Fixing Common C Language Errors by Deep Learning. In Multi-Object Specifications. In International Conference on Software
AAAI. Engineering (ICSE). 925–935.

13
[33] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, [40] Bogdan Vasilescu, Casey Casalnuovo, and Premkumar T. Devanbu.
Alberto Bacchelli, and Premkumar T. Devanbu. 2016. On the "natural- 2017. Recovering clear, natural identifiers from obfuscated JS names.
ness" of buggy code. In Proceedings of the 38th International Conference In Proceedings of the 2017 11th Joint Meeting on Foundations of Software
on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016. Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017.
428–439. 683–693.
[34] Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic [41] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning
Model for Code with Decision Trees. In OOPSLA. semantic features for defect prediction. In Proceedings of the 38th In-
[35] Veselin Raychev, Martin T. Vechev, and Andreas Krause. 2015. Predict- ternational Conference on Software Engineering, ICSE 2016, Austin, TX,
ing Program Properties from "Big Code".. In Principles of Programming USA, May 14-22, 2016. 297–308.
Languages (POPL). 111–124. [42] Andrzej Wasylkowski and Andreas Zeller. 2009. Mining Temporal
[36] Veselin Raychev, Martin T. Vechev, and Eran Yahav. 2014. Code com- Specifications from Object Usage. In International Conference on Auto-
pletion with statistical language models. In ACM SIGPLAN Conference mated Software Engineering (ASE). IEEE, 295–306.
on Programming Language Design and Implementation, PLDI ’14, Edin- [43] Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory
burgh, United Kingdom - June 09 - 11, 2014. 44. Networks. CoRR abs/1410.3916 (2014).
[37] Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, [44] Martin White, Michele Tufano, Christopher Vendome, and Denys
Michael Pradel, and Yulissa Arroyo-Paredes. 2017. Detecting Argu- Poshyvanyk. 2016. Deep learning code fragments for code clone
ment Selection Defects. In OOPSLA. detection. In Proceedings of the 31st IEEE/ACM International Conference
[38] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, on Automated Software Engineering, ASE 2016, Singapore, September
and Gabriele Monfardini. 2009. The graph neural network model. IEEE 3-7, 2016. 87–98.
Transactions on Neural Networks 20, 1 (2009), 61–80. [45] Xin Ye, Hui Shen, Xiao Ma, Razvan C. Bunescu, and Chang Liu. 2016.
[39] Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. From word embeddings to document similarities for improved infor-
Manning. 2011. Parsing Natural Scenes and Natural Language with mation retrieval in software engineering. In Proceedings of the 38th
Recursive Neural Networks. In Proceedings of the 28th International International Conference on Software Engineering, ICSE 2016, Austin,
Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, TX, USA, May 14-22, 2016. 404–415.
June 28 - July 2, 2011. 129–136.

14

You might also like