0% found this document useful (0 votes)
41 views10 pages

A Study of The Internal and External Effects of Concurrency Bugs

1) The document analyzes concurrency bugs in MySQL, a widely used database server. 2) It found a significant number of "latent concurrency bugs" that silently corrupt data structures before potentially causing failures later. 3) It also studied bugs that cause failures other than crashes, such as "Byzantine failures," to understand how fault tolerance techniques handle such bugs.

Uploaded by

kyoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

A Study of The Internal and External Effects of Concurrency Bugs

1) The document analyzes concurrency bugs in MySQL, a widely used database server. 2) It found a significant number of "latent concurrency bugs" that silently corrupt data structures before potentially causing failures later. 3) It also studied bugs that cause failures other than crashes, such as "Byzantine failures," to understand how fault tolerance techniques handle such bugs.

Uploaded by

kyoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2010 IEEEIIFIP International Conference on Dependable Systems & Networks ( DSN)

A Study of the Internal and External Effects of Concurrency Bugs

Pedro Fonseca, Cheng Li, Vishal Singhal� and Rodrigo Rodrigues


Max Planck Institute for Software Systems (MPI-SWS)

Abstract However, parallel programming is challenging and of­


ten error prone. It is difficult enough for programmers to
Concurrent programming is increasingly important for reason about all the possible inputs and the flow of execu­
achieving performance gains in the multi-core era, but it is tion in single-threaded applications; reasoning about all the
also a difficult and error-prone task. Concurrency bugs are different thread interleavings that can occur in concurrent
particularly difficult to avoid and diagnose, and therefore in programs, combined with all possible inputs, is even more
order to improve methods for handling such bugs, we need a difficult. Additionally, the non-determinism that is inherent
better understanding of their characteristics. In this paper to concurrency bugs, which are only triggered under certain
we present a study of concurrency bugs in MySQL, a widely thread interleavings, makes it difficult to reproduce, iden­
used database server. W hile previous studies of real-world tify, analyze, or correct such programming mistakes. But at
concurrency bugs exist, they have centered their attention the same time this non-determinism can be essential in han­
on the causes of these bugs. In this paper we provide a dling them (e.g., using fault detection, fault tolerance, or
complementary focus on their effects, which is important for fault recovery) since it enables exploring redundant, diverse
understanding how to detect or tolerate such bugs at run­ executions using different thread interleavings.
time. Our study uncovered several interesting facts, such To improve methods for addressing concurrency bugs, it
as the existence of a significant number of latent concur­ is important to have a thorough understanding of the char­
rency bugs, which silently corrupt data structures and are acteristics of these bugs. W hile a few studies of concur­
exposed to the user potentially much later. We also high­ rency bugs exist [11,14,22], they either focus on artificially
light several implications of our findings for the design of injected bugs, or, in the few cases where real applications
reliable concurrent systems. were studied, they mostly focus on the causes of these bugs,
and limit the study of their effects to whether they cause
deadlocks or not. Such studies are useful for determining
what kinds of programming mistakes are typical of such
1 Introduction
applications, and can drive the design of program analysis
tools for finding these bugs [27].
We are witnessing an unprecedented rise in the paral­
However understanding the effects of concurrency bugs
lelism of computer systems. The number of cores in com­
is important for a different set of reasons than why it is in­
modity processors has been steadily increasing. Today,
teresting to study their causes. Analyzing the effects allows
dual-core and quad-core processors are commonplace; In­
us to assess how efficiently existing detection approaches
tel has recently announced an 8-core processor [3], and
handle these bugs. And, more importantly, it can serve
specialized CPUs with even more cores are currently be­
as a guide for further development not only of tools and
ing manufactured [1,2]. However, increasing the number
methodologies that detect, but also of tools and methodolo­
of processors is not the way by which CPU manufacturers
gies designed to tolerate and recover from the faults and
have traditionally increased the performance of their hard­
errors caused by such bugs. To give a simple example, it is
ware. Clock speeds no longer increase at a significant rate;
important to understand how often concurrency bugs cause
as a result, software no longer automatically runs signif­
failure modes where the server returns incorrect replies (i.e.,
icantly faster as new chips are deployed. Consequently,
a Byzantine failure), in order to gauge the effectiveness of
for software to extract performance gains out of the extra
using multi-threaded replicas to ensure fault diversity in a
processing capacity, programmers will have to design their
Byzantine-fault-tolerant replication scheme [10].
software in a more parallel way.
In this paper we provide the complementary angle of
·Currently a student at BITS Pilani, India. Work done during his in­ studying the effects of concurrency bugs that affect paral­
ternship at MPI-SWS. lel applications. In particular, we exhaustively study real

978-1-4244-7501-8/10/$26.00 ©201O IEEE 221 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

concurrency bugs that were found in MySQL [5], a mature, These characteristics make MySQL representative of some
widely-used database server application. of the biggest challenges that we will be facing as complex
Our study produced several interesting findings. First, applications become more and more concurrent.
we found a non-negligible number of latent concurrency In Section 3 we provide some brief background on
bugs. Latent concurrency bugs, when triggered, do not be­ MySQL, which will help in better understanding our results.
come immediately visible to users. Instead, these concur­
rency bugs first silently corrupt internal data structures, and 2.2 Concurrency bug selection
only potentially much later cause an application failure to
become externally visible!. Latent concurrency bugs have
The MySQL versions that are affected by the bugs that
been anecdotally reported [13], but we are the first to study
were reported in the bug report database range from version
their extent, and their internal and external effects in detail.
3.x to 6.x and the oldest bug reports date back to 2003.
A second finding is related to bugs that cause the ap­
The MySQL bug report database contains a very large
plication to fail in ways other than silently crashing. We
number of bugs. Therefore, to make the task feasible, we
characterize Byzantine failures that are caused by concur­
automatically filtered bugs that are not likely to be relevant
rency bugs. Some of our findings were surprising, like the
by performing a search query on the bug report database.
fact that these bugs cause subtle changes in the output that
Our search query filtered bugs based on (1) the keywords
would be difficult to find using existing run-time monitor­
contained in the bug description, (2) the status of the bug
ing tools, or the fact that there exists a strong correlation
and (3) the bug category.
between bugs that cause Byzantine failures and latent bugs.
We searched the MySQL bug report database for bugs
Our findings have implications for the design of tools and
that contained keywords commonly associated with concur­
methodologies that address concurrency bugs. For the con­
rency bugs. Such keywords included the following terms:
venience of the reader we present a summary of our main
lock, acquire, compete, atomic, concurrency, synchroniza­
findings together with their implications in Table 1.
tion, etc. In addition to this we searched for bugs whose sta­
The remainder of the paper is organized as follows. In
tus was closed (i.e., bugs that are no longer under analysis
Section 2 we describe our methodology. We then present
by the developers/debuggers). It would have been interest­
an overview of the MySQL application in Section 3. The
ing to also consider bugs with other status (such as won'tfix
results of our study are presented in Section 4 and in Sec­
and can't repeat) but these bug reports are not likely to have
tion 5 we discuss their implications. We survey related work
detailed discussions and more importantly, in general, they
in Section 6 and we conclude in Section 7.
won't contain patches. Without reasonably complete bug
reports it would not be possible to thoroughly understand
2 Methodology the bugs they report.
Next, to exclude bugs from stand-alone utilities that are
In this section we present the methodology that we unrelated to the multi-threaded server, our search query also
adopted to find and analyze concurrency bugs. Our method­ limited the search to bugs that were related to MySQL
ology is similar to one used in previous work [22]. Server, including those that were within the Storage En­
gines category [26].
2.1 Choice of concurrent application Finally, we randomly sampled a subset of the bugs that
matched our search query and manually analyzed them.
We selected MySQL as the target of our study for three The manual inspection revealed that some of the bugs that
main reasons. First, it is a widely deployed database. matched the search query were not concurrency bugs (de­
Databases are a critical component of the IT infrastructure fined in Section 3) and so we also excluded them. In addi­
of many corporations, and MySQL represents a substantial tion, we excluded bugs for which the bug log did not con­
share of that market (about 1/3 of deployed database sys­ tain enough information to analyze them. After filtering, we
tems [4]). This implies that there is market pressure for obtained a final set with 80 concurrency bugs that were an­
a quality development and maintenance process, so this is alyzed, a number that is very close (or even superior) to the
an instance of well-maintained software where finding and number of bugs analyzed in previous studies [11,22].
eliminating bugs matters. Second, it is an open source appli­ Table 2 shows the bug count across the different stages
cation with a well-maintained bug report database. Having of the bug selection process.
access to the source code and the bug logs is necessary for Note that this selection process has two main limitations.
an in-depth analysis. Finally, it is a highly concurrent ap­ First, the search query can miss some actual concurrency
plication with rich semantics, and it has a large code base. bugs. However, a concurrency bug report that does not con­
lThe term latent bug is used in other papers [8,18,20] with an unrelated tain any of the main keywords associated with concurrency
meaning - that of a bug that went undetected by the programmer. is also more likely to be incomplete and therefore more dif-

978-1-4244-7501-8/10/$26.00 ©201O IEEE 222 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
20lO IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

Finding Implication
Evolution of concurrency bugs
According to the opening dates of our sampled bugs, This shows the increasing need for new tools and
the proportion of fixed bugs that involved concurrency methodologies to handle concurrency bugs.
more than doubled over the last 6 years.
External effects of concurrency bugs
We found slightly more non-deadlock bugs (63%) than Having good tools to handle deadlock bugs is not
deadlock bugs (40%). enough - we also need to handle non-deadlock bugs.
We found a significant fraction of semantic/Byzantine Techniques for Byzantine fault tolerance can potentially
bugs (15%). handle a considerable fraction of concurrency bugs.
Immediacy of effects
Latent concurrency bugs were also found in significant Tools and methodologies such as proactive recovery
numbers (15%). can be leveraged to mask errors caused by a significant
numbers of concurrency bugs.
Of the latent concurrency bugs analyzed, 92% were se- Given the high correlation between these classes of
mantic bugs and conversely 92% of the semantic bugs bugs, techniques that handle one class should also han-
were also latent bugs. die the other.
Semantic concurrency bugs
The vast majority of semantic bugs (92%) generated Run-time monitoring tools will have to devise complex
subtle violations of application semantics. application-specific checks to detect the presence of se-
mantic bugs.
Internal data structures
Most of the examined latent bugs (92%) corrupted mul- Techniques that detect inconsistencies among data
tiple data structures. structures could be used to detect latent bugs. Analyz-
ing data structures individually might not suffice.
Severity and fixing complexity of bugs
Latent bugs were found to be slightly more severe than Latent bugs are an important threat to software reliabil-
non-latent bugs. ity and, therefore, latent bugs should also be addressed.
Latent bugs were found to be easier to fix than non- Further studies should be performed to analyze the rea-
latent bugs. sons for this difference.

Table 1. Main findings of this study and their implications. T he methodology for collecting the data
presented here is described in Section 2 and the results are explained in detail in Section 4.

Phase Number of bugs bugs. We analyzed the bugs using information contained in
Total MySQL server closed bugs 12.5k the bug reports (including the patches), as well as the source
Concurrency related keyword matches 583 code of the application.
Sampled bugs 347
Bug reports contain several types of information that are
Concurrency bugs analyzed 80
useful for filtering out non-concurrency bugs, and for under­
Table 2. Bug counts for different stages of the standing their characteristics. In particular, bug reports con­
analysis. tain not only the description of the bug, but also discussion
among the developers and debuggers about how to diagnose
and solve the problem. The information contained in these
discussions is often important to understand the bugs, in
ficult to successfully analyze. Second, concurrency bugs particular to determine whether they are concurrency bugs,
are likely to be underreported, which would explain why and to understand their effects. Typically the bug report will
out of a total of about 12.5k bugs in the bug database we also include the patch, and even the method to reproduce
only found 80 concurrency bugs. the bug; sometimes more than one patch attempt is made
before developers agree on a definitive patch. Bug reports
2.3 Manual analysis of bug reports also include additional fields such as the perceived severity,
the status, and the software version affected.
We manually analyzed the bug reports of the sampled list We used all these types of information contained in bug
of bugs, focusing on trying to understand the effects of the reports to gain an understanding of how bugs are triggered

978-1-4244-7501-8/10/$26.00 ©20lO IEEE 223 DSN 20lO: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

and when they are what are their effects 2. In addition, some Despite the existence of recent proposals for other types
of this information was also used to estimate the complexity of synchronization primitives such as transactional mem­
of fixing concurrency bugs and their severity. ory [17], there is value in studying and improving the meth­
ods that address the problems with lock-based synchroniza­
tion. This is not only because we still run many applications
3 MySQL
that use locks, which will benefit from being made more ro­
bust for years to come, but also because the vision behind
In this section we provide a brief overview of the char­ such proposals is not to entirely replace locks, but instead
acteristics of MySQL that are relevant for this study. to use these new primitives in smaller sections of the code
where the possible performance impact would be lower.
3.1 Internal structure
3.3 Request vs. transaction concurrency

MySQL is a complex code base where the state of the


To correctly understand the meaning of concurrency
server is spread across multiple data structures that are
bugs the distinction between request and transaction-level
stored both in memory and persistently. Here we describe
concurrency needs to be clear. In a database system, client
some of the main data structures that will be referred to in
operations are logically grouped into transactions, each of
later sections.
which consists of a sequence of requests (e.g., requests to
An important class of stored structures are data files
begin a transaction, read or write to the database, and com­
that contain the contents of different tables stored in the
mit or abort the transaction). There is often some confusion
database. In addition, a series of files containing the in­
between the notion of concurrent transactions and concur­
dexes of the tables are also maintained, which allow for fast
rent requests, and which kinds of concurrency bugs are we
lookups by the contents of certain columns of the tables.
interested in.
Another important persistent structure is the binary log
We will only analyze bugs that are triggered by concur­
(referred to as the binlog structure), which is used for two
rent individual requests, since these are the ones that reflect
different purposes. The log is mainly useful when primary­
the traditional concurrency problems that arise in parallel
backup replication of the database is used, in which the pri­
programs. Bugs that are triggered by concurrent transac­
mary replica writes to the binlog the statements correspond­
tions but can be reproduced deterministically by a given se­
ing to all client requests that modify the database state. The
quence of requests are not considered concurrency bugs in
backup replicas then sequentially re-execute the statements
this study.
contained in the binlog. The other use of the binlog is re­
Thus we define a concurrency bug as one where the ap­
lated to other recovery operations such as restoring database
plication deviates from the intended behavior, given a cer­
state from a backup file, in which case some events that
tain pattern of inputs, but it must be the case that the bug
were logged after the backup operation must be re-executed.
is only manifested under specific thread interleavings. This
Finally, MySQL contains a series of caches that speed
definition is general enough to include both safety problems
up access to persistent structures or processing of requests.
(e.g., server crash or issuing wrong replies) and liveness
For instance, a table cache holds the descriptors of recently
problems (e.g., deadlocks or even performance bugs).
accessed tables, while a query cache holds the results of
recently executed queries.
4 Results
3.2 Concurrent programming
In this section we present the results of our analysis of
the 80 concurrency bugs that we found in the MySQL bug
The use of concurrency in MySQL is typical of a server database. A summary of these results and their main impli­
application. Clients issue several requests to the database cations are also presented in Table 1.
server, which are grouped into sessions (called connec­
tions). Each connection is handled by a separate thread on 4.1 Evolution of concurrency bugs
the server side, and different threads contend for access to
many shared data structures, such as the ones we mentioned
We investigated the proportion of concurrency bugs
above. To synchronize access to these structures, threads
present in the bug database and how this proportion evolves.
mostly resort to locks but also use condition variables.
We were interested in knowing whether concurrency bugs
2The are becoming more prevalent. To determine this, we iden­
raw data gathered from this manual analysis can
be found at http://www.mpi-sws.org/-pfonseca/ tified the opening and closing year of the concurrency bugs
dsn2010-bug-study.tgz that we analyzed as well as of all closed bugs within the

978-1-4244-7501-8/10/$26.00 ©201O IEEE 224 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN)

25 25 25 25
# Concurrency bugs --+-- # Concurrency bugs --+--

§ §

20 Proportion ---)(--- 20 "


on
20 Proportion )(-- - 20
"
on ---

.D .D
>.
15 15
!" 15 15
!"
>.

"
u
"
u

g" 0
g"
10 10 10 / 10 .Q
" .x.. ....
'e u
-
8.
u

0
K-- -->E __--- ... 8. 6 - )(
___ ,.._ _--- x- ------- -----
__

u u )(_____ ---x
__
5 ____
5 5 5 Ie
"" _
______
Ie "" "-
"-
0 0 0 0
2003 2004 2005 2006 2007 2008 2009 2003 2004 2005 2006 2007 2008 2009
Time (year) Time (year)

Figure 1. Evolution of bugs (by open date). Figure 2. Evolution of bugs (by close date).

MySQL server category. To obtain the set containing all Table 3. Note that the sum of all occurrences is larger than
bugs we excluded the keyword part of the search together the total number of bugs because some bugs fit into more
with the sampling phase explained in Section 2. For each than one category.
year we counted the number of concurrency bugs and their We can see that there are slightly more bugs that cause
proportion (compared with generic bugs). We looked at non-deadlock conditions (63%) than deadlock conditions
both the opening date and closing date because program­ (40%), and among the non-deadlock bugs the most preva­
mers typically require a significant amount of time (i.e., lent consequences are either causing the server to crash
many months) to solve the bugs under analysis. The results (28%) or providing the wrong results to the user, which we
are presented in Figures 1 and 2. From these results we can term semantic bugs (15%).
see that there has been a trend of increasing number and Semantic bugs are Byzantine failures, where the applica­
proportion of concurrency bugs over the years. However, tion provides the user with a result that violates the intended
this trend does not seem to be very prominent. semantics of the application. This is an interesting class
The data that we collect does not allow us to determine of bugs since masking their effects requires sophisticated
the causes underlying this finding, however we can think of (and possibly expensive) techniques such as Byzantine­
two possible reasons for this slight increase. One possible fault-tolerant replication [10] or run-time verification of the
explanation is that the advent of multi-core hardware causes behavior of the application against a specification of the
users and developers to stumble upon these bugs more of­ system [30]. We discuss these bugs in more detail in Sec­
ten than they used to in the past. Another explanation that tion 4.4.
we cannot rule out is that developers, while trying to fur­
The high percentage of deadlock bugs that we encoun­
ther parallelize the code, actually increase the number of
tered leads us to believe that, despite significant research
concurrency bugs that they introduce.
to address deadlock bugs, in practice this class of bugs still
Of the concurrency bugs that we sampled, the oldest con­ constitutes a significant problem for the robustness of soft­
currency bug was opened in March 2nd, 2003, while the ware. The percentage of deadlock bugs that our study found
youngest was closed in September 16th, 2009. Therefore, is in line with results from other studies [22].
to make the comparison fair, we excluded the bugs that were
The remaining three classes of external effects were
outside this range from the list of generic bugs used to com­
slightly less prevalent. These are error messages (9%),
pute the proportions.
which we distinguish from the class of semantic bugs, de­
To interpret these results it should also be taken into con­
spite the fact that when error messages are provided to the
sideration that, as we show in Section 4.7, the time it takes
user an unexpected result is also returned. We distinguish
to close a concurrency bug can be quite long (e.g., some
error bugs from semantic bugs by the fact that an error is
bugs took more than a year to fix). This explains why the ab­
detected by the server and therefore is explicitly flagged in
solute number of bugs opened in the last year is low: many
the reply to the client request, and can be handled by the
concurrency bugs potentially discovered in 2009 have not
client application appropriately. For instance, in one bug
yet been fixed, which means they are not yet closed and
(bug #42519) when a restore operation is performed con­
were therefore not accounted for in this study.
currently with an insert operation a generic error message is
returned to the user. We also found a number of bugs (8%)
4.2 External effects in which client requests hang (the client does not receive a
reply), which differs from a deadlock situation where one
We analyzed the concurrency bugs with respect to the thread or a series of threads are waiting in a circular de­
external effects that are exposed to the clients, and divided pendency. Typically these are caused by a thread that fails
these effects into six categories. The results are presented in to release a certain lock, causing another thread that tries to

978-1-4244-7501-8/10/$26.00 ©2010 IEEE 225 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

External effect Number of bugs External effect Number of bugs


Crash 22 Crash 1
Deadlock 32 Deadlock 0
Error 7 Error 0
Hang 6 Hang 0
Performance 5 Performance 1
Semantic (Byzantine) 12 Semantic (Byzantine) 11

Table 3. External effects of concurrency Table 4. Effects of latent concurrency bugs.


bugs.

acquire it to wait forever. Finally, we found a few (6%) con­


currency bugs that caused performance degradation (e.g., a latent error is activated and becomes a failure. On the
memory leaks that increase the number of page faults the other hand, this opens an opportunity for the methods that
server incurs). handle non-crash faults to try to heal the state of the ap­
plication in the background instead of masking the effects
of these faults in the foreground. For instance, rather than
4.3 Latent bugs
tolerating semantic errors using Byzantine Fault Tolerance
(BFT) replication, where the output of each request is voted
Next we analyzed whether the bugs caused latent errors
upon, one might be able to get similar results by having a
or not. We define a latent bug as one where the (concur­
foreground replica that issues the reply, and a background
rent) requests that cause the erroneous state to occur differ
replica that checks and recovers the service state.
from the request (or requests) that cause the external effects
of the bug to be exposed to the clients (i.e., the violation to
the application's specification). In other words, latent bugs A concrete example of a latent bug will help the reader
cause internal data structures to be silently corrupted (i.e., understand some of the typical patterns surrounding bugs
an error) but do not immediately cause a wrong output (i.e., that are both latent and semantic. Bug #14262 involved con­
a failure). A failure is only triggered by a subsequent re­ current requests updating both the contents of the database
quest that may not have to run concurrently with any other (e.g., table contents) and the binlog structure. This bug is
requests. caused by the code not enforcing the same order for con­
We found that a relevant fraction of concurrency bugs in current requests that update both the table contents and the
our study were latent (15% versus 85% non-latent bugs). binlog. Thus, when a specific set of statements is sent to
This result was somewhat surprising and has an interesting the primary replica, the primary replica updates the table
implication. The fraction is large enough that we believe data by executing the statements in one order but, depend­
there is value in developing tools that try to recover the in­ ing on the exact interleaving of threads, may write those
ternal state of the concurrent application. Performing such statements to the binlog in the reverse order. The result of
a recovery could prevent concurrency bugs from affecting this bug to the client is only visible after a fault of the pri­
the correct behavior of the application, even after the con­ mary replica occurs (or when clients otherwise contact the
current requests that cause the error have already been exe­ backup replicas). In this case, one of the backups will take
cuted and the application state is corrupt. over with a state that diverges from the previously observed
We also analyzed how latent bugs were categorized ac­ state (in that it reflects a different sequence for transaction
cording to the previous analysis of their external effects. execution) and subsequent results will be incoherent with
The results in Table 4 show a very high correlation between those that were previously returned.
latent and semantic bugs: 92% of the latent bugs manifest
themselves by returning wrong results to the client, and con­ In the remainder of this section we will analyze semantic
versely also 92% of the semantic bugs are latent. (The fact and latent bugs in more detail. The reason for our focus is
that these values are exactly the same is only a consequence twofold. First, we found these bugs to have a relevant (and
of the relatively small sample size.) perhaps unexpected) prevalence. Second, and more impor­
We see two possible consequences of the high correla­ tantly, although existing tools are very effective at handling
tion between latent and semantic bugs. On the one hand, application crashes (e.g., Rx [28]) and deadlocks (e.g., Dim­
methods to address the problems caused by latent bugs will munix [19]), they are not so effective at handling the re­
have to take into account that they manifest themselves maining, more subtle types of failures. Thus, there is a re­
through violations of the application semantics (rather than search opportunity for improving methods that address this
crashing or halting), which raises the bar for detecting when type of concurrency bug.

978-1-4244-7501-8/10/$26.00 ©201O IEEE 226 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

4.4 Characteristics of semantic bugs Data structure Number of bugs Persistent?


Data file 11 Yes
Index file 9 Yes
We further analyzed the incorrect outputs returned by se­
Definition file 8 Yes
mantic bugs in order to determine how difficult it is to de­
Query cache 7 No
tect them, e.g., using a run-time monitoring tool [30], which Key cache 6 No·
would avoid the use of more expensive techniques such as Binlog 5 Yes
BFf replication [10].
Out of all the semantic bugs, we found only one to have Table 5. Most frequent data structures in­
a self-inconsistent output, meaning that the buggy output volved in latent bugs. * T he contents of the
clearly deviated from the expected reply. In this particular cache can also be written back to disk.
bug, the wrong reply returned to clients contains informa­
tion about the contents of a certain table, but at the same
time the reply also contains information that indicates that
maining 92% involve inconsistency between separate struc­
the table does not exist in the database.
tures.
None of the remaining bugs were self-inconsistent, im­
Next we analyzed whether the data structures involved
plying that there are limited benefits from detection tech­
are persistent structures stored on disk or volatile structures
niques that try to validate the correctness of the application
kept in memory. Table 5 shows that the three most affected
by analyzing the replies.
data structures are persistent, namely the files that contain
We further analyzed these results and categorized the
the database contents, the respective indices, and the afore­
output of semantic bugs into two groups. Some of the bugs
mentioned binlog file. We also found a large number of
did not fit into either of these groups.
bugs involving caches that are only stored in main memory.
The first group, containing 58% of these bugs, corre­
Note, however, that these results do not allow us to draw
sponds to outputs that reflect an ordering of previously ex­
conclusions about the probability that accesses to these data
ecuted transactions that is inconsistent with the ordering
structures trigger bugs, given that we do not know how often
that was implied in previous replies. The latent bug we
different structures are accessed (and also we cannot claim
described before where binlog entries were logged in the
that we have a perfectly representative sample of the exist­
wrong order is an example of such a bug: after the primary
ing bugs).
becomes faulty, the output of the system reflects the order
Note that the numbers in Table 5 do not add up to the
in which transactions were recorded in the binlog, which
total number of latent bugs because certain bugs affected
differs from the order in which they had been originally ex­
more than one data structure, as explained before.
ecuted.
The second group, containing 25% of the bugs, corre­
sponds to violations of transactional semantics, in particu­ 4.6 Recovering from latent errors
lar of the isolation property of the transactions. This means
that transaction A could see the intermediate effects of a We looked at the ability of the application to recover
concurrent transaction B (e.g., some of the updates made from latent bugs after they have caused an error (i.e., cor­
by transaction B, but not all of them). rupted the internal state). The recovery mechanisms we
Finally, 17% of the semantics bugs did not fall into either consider in this section are relatively simple ones: we iden­
of the previous two categories. tified the latent errors that can be recovered by a server
restart or other simple mechanisms (e.g., reloading indexes)
4.5 Internal effects of latent bugs that do not require writing extensive recovery-specific code.
We present the results in Table 6. Note that some bugs allow
more than one simple recovery mechanism.
We also analyzed the set of latent bugs in more detail.
We found that in one third of the cases it is possible to
In our analysis, we paid close attention to how the internal
use simple mechanisms to recover latent errors such that
state was being corrupted, so that we could gain better un­
they go completely unnoticed by users. This increases the
derstanding of the kinds of techniques that can be useful for
chances of adopting proactive recovery techniques.
detecting the errors before they are exposed to the user and
for recovering the internal state of the application.
First, we determined whether each bug corrupted a sin­ 4.7 Severity and fixing complexity
gle high-level data structure, or modified two or more data
structures in an inconsistent way (leaving them in an in­ Finally, we compared concurrency bugs belonging to dif­
correct state relative to each other). Only 8% of the latent ferent categories with respect to their severity and to the
concurrency bugs involve a single data structure, and the re- complexity of fixing them, according to the bug report fields

978-1-4244-7501-8/10/$26.00 ©201O IEEE 227 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

Number of bugs Bug immediacy Time Patches Files Disc.


No simple recovery mechanism 8 Latent 114/79 3.8/2 2.3/1 10.4/7.5
Allow for simple recovery: 4 Non-latent 137/90 2.7/2 3.9/1 11.6/9
Server restart 4
Other mechanisms 3 Bug category Time Patches Files Disc.
Deadlock 125/90 1.9/2 1.5/1 9.3/9
Table 6. Recovery mechanisms for latent con­ Crash 128/83 3.5/2 7.7/3 12.9/11
currency bugs. Error 150/94 3.0/2 4.4/4 17.0111
Hang 210/116 4.5/2 3.8/2 13.2/11
Performance 125/92 1.4/2.5 1.8/2 8.2/6
Semantic 108/67 3.8/2 2.2/1 10.5/8
Bug immediacy Severity
Latent 2
Table 8. Complexity of fixing concurrency
Non-latent 2.2
bugs according to their immediacy and cat­
egory. For each class of bugs we present the
Bug category Severity
average/median for each of the four metrics:
Deadlock 2.3
Crash 1.7 time in days, number of patches, number of
Error 2.4 files in the patches and the number of com­
Hang 2 ments in the discussion.
Performance 3
Semantic 2.2

Table 7. Average severity of concurrency


5 Discussion and limitations
bugs according to their immediacy and cat­
egory. Maximum severity is rated as 1 (i.e., One of the results of our study is that the percentage of
critical bug) while minimum severity is rated concurrency bugs present in the bug database is low. This
asS. is not very surprising, since it has long been believed that
concurrency bugs are underrepresented. The fact that con­
currency bugs are hard to observe and reproduce (in fact
they are commonly referred to as Heisenbugs [15]) is likely
that specify these properties. Additionally, we also com­ to contribute to their underrepresentation in bug databases
pared non-latent bugs against latent bugs with respect to for three main reasons. First, when users are faced with
these two properties. the bug a single time they may not even be sure that it is
a problem with the software and might not report it at all.
The average severity of bugs is compared in Table 7. The
Second, even when users are able to reproduce bugs on their
results show that latent bugs were considered to be slightly
machines, it might not be possible to reproduce the bug in
more severe on average than non-latent bugs. In the ranking
the developer's environment due to small differences in the
of severity by external effects, crash bugs were found to be
environments. Third, even if developers manage to repro­
the most severe while, as expected, performance bugs were
duce the bug, they might not be able to systematically re­
found to be the least severe.
produce it using traditional debugging methods, since some
For the complexity of fixing concurrency bugs we used debugging tools and methods might interfere with the re­
four metrics that we extracted from the bug reports: time to producibility of the bug.
fix the bug, number of patching attempts, number of files In this work we focused our attention on concurrency
changed in the final patch, and the number of comments ex­ bugs found in the MySQL application. A previous paper
changed in the bug reports. Although none of these metrics compared concurrency and non-concurrency bugs of three
is perfect, in combination they help us estimate the com­ different database systems including MySQL [32]. It con­
plexity of fixing these bugs. We present a comparison of cluded that the three different database systems exhibited a
the four complexity metrics in Table 8. Since some of these very similar proportion of crash vs. non-crash faults (i.e.,
fields contain significant outliers, in addition to presenting a bit over half of the bugs led to non-crash faults in each
the average for all four metrics we also present the median. database system). W hile not conclusive, this leads us to be­
Our analysis of the fixing complexity revealed a surpris­ lieve that the bug patterns we found in MySQL might also
ing result: non-latent bugs were found to be more complex apply to other database systems. More analyses are required
to fix than latent bugs in all metrics except for the number to confirm whether this is in fact the case.
of patches. We do not have a clear explanation, so we defer On the other hand, it seems less likely that these results
study of the reasons for this to future work. can be generalized to arbitrary multi-threaded applications.

978-1-4244-7501-8/10/$26.00 ©201O IEEE 228 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

Applications can be very different (e.g., some have graph­ source applications (including MySQL) but the focus of
ical user interfaces while others do not, some applications their work was quite different from ours. They analyzed all
use the client-server model while others do not). As an ex­ bugs (among which only 12 were concurrent) and focused
ample, from the data collected in another study [22] that exclusively on determining whether generic recovery tech­
compared different applications, about half of the deadlocks niques such as process pairs would be effective in tolerating
found in MySQL involved only 1 resource while almost all them. In their case, concurrency bugs were only one pos­
of the deadlocks found in Mozilla involved 2 or more re­ sible type of bug that fell into the category for which such
sources. Given the very different characteristics of applica­ techniques are effective. In contrast, we focus on a more
tions, we believe that the conclusions that we present here narrow class of bugs by limiting ourselves to concurrency
are unlikely to be generalizable to arbitrary multi-threaded bugs, but provide a broader analysis taking into considera­
applications. tion several characteristics of these bugs.
The number of bugs analyzed in this study is compara­ Farchi et al. analyzed concurrency bugs, but by artifi­
ble to the number of bugs analyzed in other related stud­ cially creating them [14]. The methodology adopted by
ies [11,22,32]. However, it is worth noting that our results the study was to ask programmers to write programs con­
could potentially suffer from two sources of bias. First, our taining concurrency bugs, which arguably may not lead to
sample, in absolute terms, is small. Obviously, this limits bugs that are representative of real world problems. In con­
the confidence in the results, but at the same time it is a trast, we analyze a database of bugs in a widely used, well­
limitation that is difficult to overcome due to the time re­ maintained application.
quired to gather the data and the amount of data available.
Recently Lu et al. [22] studied real concurrency bugs that
(This is a limitation shared by previous studies.) Second, we
were found in four open source applications. Using the re­
only analyzed bugs that were documented and fixed. This
spective bug report databases, the authors analyzed a total
means we did not account for bugs that were not fixed (or
of 105 concurrency bugs. Their study focused on several
even found), nor bugs that were fixed but not documented.
aspects of the causes of concurrency bugs, and the study of
We believe that these biases are very difficult to overcome
their effects was limited to determining whether they caused
given the nature of bugs in general but specifically given
deadlocks or not. We build on this study, in particular by
the nature of concurrency bugs. Nevertheless, more studies
using a very similar methodology for deciding which bugs
are desirable to improve our understanding of concurrency
to analyze, but provide a complementary angle by studying
bugs.
the effects of concurrency bugs (e.g., whether concurrency
bugs are latent or not, or what type of failures they cause).
6 Related Work
There also exist various studies of bug characteristics
in software systems focusing on several aspects of generic
Given the importance of software reliability and the
bugs [12, 16, 21, 25, 31]. In contrast, our study focuses
prevalence of bugs in software in general, many studies
specifically on concurrency bugs, which are more challeng­
about bugs have previously been undertaken.
ing to analyze.
There is a large body of literature about the propaga­
tion [33] and even prediction [24] of bugs in source code. Recently Sahoo et al. have been trying to understand the
Some of these studies use the revision control system to reproducibility of bugs [29]. W hile the main focus of their
understand the behavior of programmers and its effects on study was not concurrency bugs, the authors distinguished
software reliability (e.g., which components or source code concurrency bugs from non-concurrency bugs when trying
files are most prone to errors). This work is complemen­ to characterize their reproducibility.
tary to the work presented in this paper, which is focused Finally, there exist many proposals for handling concur­
on a specific class of bugs (i.e., concurrency bugs) and on rency bugs. These represent not only different techniques,
understanding their consequences. but also very different approaches to improving software re­
In a previous paper, researchers analyzed the conse­ liability. They include approaches to avoid bugs [17], to
quences of bugs for three different database systems [32]. find bugs [13], to mask bugs [32] and even to recover from
However the authors did not distinguish between con­ bugs [9]. Because concurrency bugs, in addition to being
currency and non-concurrency bugs, and only evaluated dependent on the input, are also dependent on the interleav­
whether they caused crash or Byzantine faults (since that ing chosen by the operating system, there are approaches
paper was focused on presenting a replication architecture, that specifically handle concurrency bugs by artificially dis­
instead of being focused on studying bugs). In contrast, we turbing [6], controlling [23] or limiting [7] thread interleav­
provide a detailed analysis of the effects of the bugs and we ings. Our work is complementary in that it has the potential
focus on concurrency bugs. to guide and motivate the development of these kinds of
Chandra et al. [11] looked at bug databases of three open- techniques and approaches.

978-1-4244-7501-8/10/$26.00 ©201O IEEE 229 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.
2010 IEEEIIFIP International Conference on Dependable Systems & Networks (DSN)

7 Conclusion [10] M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proc.
of Operating System Design and Implementation (OSDI),1999.
[11] S. Chandra and P. M. Chen. Whither generic recovery from appli­
Concurrency bugs pose a challenge in the development cation faults? A fault study using open-source software. In Proc.
of reliable applications. Concurrency bugs are a type of bug of International Conference on Dependable Systems and Networks,
2000.
that is likely to become more and more prevalent in the de­ [12] A. Chou, 1. Yang, B. Chelf, S. Hallem, and D. Engler. An empir­
velopment life cycle as applications become more concur­ ical study of operating systems errors. In Proc. of Symposium on
Operating System Principles (SOSP),2001.
rent to take advantage of parallelism in the hardware.
[13] D. Engler and K. Ashcraft. RacerX: Effective, static detection of
To gain a better understanding of this problem, we pre­ race conditions and deadlocks. SIGOPS Operating Systems Review,
sented a study of concurrency bugs in MySQL. In contrast 37(5):237-252,2003.
to previous studies, we focused on the effects of concur­ [14] E. Farchi, Y. Nir, and S. Ur. Concurrent bug patterns and how to test
them. In International Parallel and Distributed Processing Sympo­
rency bugs rather than on their causes. sium (IPDPS),2003.
Studying how bugs manifest enabled us to produce some [15] 1. Gray. W hy do computers stop and what can be done about it?
In Proceedings of Reliability in Distributed Software and Database
interesting findings, such as a high prevalence of latent bugs Systems,1986.
that silently corrupt data structures but may take longer to [16] W. Gu, Z. Kalbarczyk, Ravishankar, K. Iyer, and Z. Yang. Charac­
become externally visible, and a strong correlation between terization of linux kernel behavior under errors. In Proc. of Interna­
tional Conference on Dependable Systems and Networks,2003.
latent bugs and bugs that cause Byzantine failures.
[l7] M. Herlihy and 1. E. B. Moss. Transactional memory: Architectural
We hope that our study can open interesting avenues for support for lock-free data structures. SIGARCH Computer Architec­
future research. In particular, we intend to develop tools that ture News,21(2):289-300,1993.
[l8] D. Hovemeyer and W. Pugh. Finding bugs is easy. SIGPLAN Notices,
address the issue of latent bugs from two different angles.
39(12):92-106,2004.
First, we need to develop better ways to find these bugs dur­ [l9] H. lula, D. Tralamazza, C. Zamfir, and G. Candea. Deadlock im­
ing the course of testing. We intend to develop better tools munity: Enabling systems to defend against deadlocks. In Proc. of
Operating System Design and Implementation (OSDI),2008.
for catching the subtle corruption of internal state caused by
[20] T. Kelly, Y. Wang, S. Lafortune, and S. Mahlke. Eliminating concur­
the kinds of bugs we analyzed. Second, latent bugs provide rency bugs with control engineering. IEEE Computer,99(1),2009.
an interesting opportunity to develop techniques that detect [21] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things
changed now?: An empirical study of bug characteristics in modem
them and heal the service state before the buggy output is
open source software. In Proc. of Architectural and System Support
seen by clients. jor Improving Software Dependability (ASID),2006.
[22] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A
comprehensive study on real world concurrency bug characteristics.
Acknowledgments SIGARCH Computer Architecture News,36(1):329-339,2008.
[23] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and
I. Neamtiu. Finding and reproducing heisenbugs in concurrent pro­
We are grateful for the feedback provided by the anony­ grams. In Proc. of Operating System Design and Implementation
mous reviewers. Pedro Fonseca was supported by a grant (OSDI),2008.
provided by FCT. [24] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller. Predicting
vulnerable software components. In Proc. of Conference on Com­
puter and communications security (CCS),2007.
References [25] T. Ostrand, E. Weyuker, and R. Bell. Predicting the location and
number of faults in large software systems. IEEE Transactions on
Software Engineering (TSE),31(4):340-355,April 2005.
[I] Azul Systems - Industry's Leading Azul Compute Appliances.
http://www.az ulsystems.com/products/compute- [26] S. Pachev. Understanding MySQL internals. O'Reilly Media, Inc.,
appliance. htm. 2007.
[2] GeForce GTX 295. http://www.nvidia.com/object/ [27] S. Park, S. Lu, and Y. Zhou. CTrigger: Exposing atomicity violation
product_geforce_9tx_295_us.html. bugs from their hiding places. In Proc. of International Conference
on Architectural Support jor Programming Languages and Operat­
[3] Intel Previews Intel Xeon 'Nehalem-EX' Processor.
http:
ing Systems (ASPLOS),2009.
//www.intel.com/pressroom/archive/releases/
2009/20090526comp.htm. [28] F. Qin, 1. Tucek, 1. Sundaresan, and Y. Zhou. Rx: Treating bugs as
allergies-A safe method to survive software failures. In Proc. of
[4] MySQL :: Market Share. http://www.mysql.com/
why-mysql/marketshare/.
Symposium on Operating System Principles (SOSP),2005.
[29] S. K. Sahoo, 1. Criswell, and V. S. Adve. An empirical study of
[5] MySQL:: The world's most popular open source database. http:
reported bugs in server software with implications for automated bug
//www.mysql.com.
diagnosis. Tech. Report 2142/13697, University of Illinois, 2009.
[6] Y. Ben-Asher, Y. Eytani, E. Farchi, and S. Ur. Producing scheduling
that causes concurrent programs to fail. In Proc. of Parallel and
[30] B. Schroeder. On-line monitoring: A tutorial. IEEE Computer,
Distributed Systems: Testing and Debugging (PADTAD),2006. 28( 6) : 72-78, lun 1995.
[31] M. Sullivan and R. Chillarege. A comparison of software defects
[7] R. L. Bocchino, V. S. Adve, S. V. Adve, and M. Snir. Parallel pro­
in database management systems and operating systems. In Proc.
gramming must be deterministic by default. In Proc. of Workshop on
of International Symposium on Fault-Tolerant Computing (F T CS),
Hot Topics in Parallelism (HotPar),2009.
1992.
[8] Y. Brun and M. D. Ernst. Finding latent code errors via machine [32] B. Vandiver, H. Balakrishnan, B. Liskov, and S. Madden. Tolerat­
learning over program executions. In Proc. of International Confer­ ing byzantine faults in transaction processing systems using commit
ence on Software Engineering (ICSE),2004. barrier scheduling. In Proc. of Symposium on Operating System Prin­
[9] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Mi­ ciples (SOSP),2007.
croreboot - A technique for cheap recovery. In Proc. of Operating [33] L. Voinea and A. Telea. How do changes in buggy mozilla files prop­
System Design and Implementation (OSDI),2004. agate? In Proc. of Symposium on Software Visualization (SoftVis),
2006.

978-1-4244-7501-8/10/$26.00 ©201O IEEE 230 DSN 2010: Fonseca et al.


Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE OESTE DO PARANA. Downloaded on January 17,2024 at 14:03:42 UTC from IEEE Xplore. Restrictions apply.

You might also like