notes (2)
notes (2)
James Aspnes
2024-12-17 13:29
i
Table of contents ii
Preface xxiv
Syllabus xxvi
1 Introduction 1
1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I Message passing 7
2 Model 8
2.1 Basic message-passing model . . . . . . . . . . . . . . . . . . 8
2.1.1 Formal details . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Network structure . . . . . . . . . . . . . . . . . . . . 10
2.2 Asynchronous systems . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Example: client-server computing . . . . . . . . . . . . 11
2.3 Synchronous systems . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Drawing message-passing executions . . . . . . . . . . . . . . 12
2.5 Complexity measures . . . . . . . . . . . . . . . . . . . . . . . 13
ii
CONTENTS iii
5 Leader election 32
5.1 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Leader election in rings . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 The Le Lann-Chang-Roberts algorithm . . . . . . . . 35
5.2.1.1 Performance . . . . . . . . . . . . . . . . . . 36
5.2.2 The Hirschberg-Sinclair algorithm . . . . . . . . . . . 36
5.2.3 Peterson’s algorithm for the unidirectional ring . . . . 37
5.2.4 A simple randomized O(n log n)-message algorithm . . 38
5.3 Leader election in general networks . . . . . . . . . . . . . . . 40
5.4 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.1 Lower bound on asynchronous message complexity . . 41
5.4.2 Lower bound for comparison-based protocols . . . . . 42
7 Synchronizers 56
7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.1 The alpha synchronizer . . . . . . . . . . . . . . . . . 58
7.2.2 The beta synchronizer . . . . . . . . . . . . . . . . . . 58
CONTENTS iv
8 Coordinated attack 64
8.1 Formal description . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 Impossibility proof . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3 Randomized coordinated attack . . . . . . . . . . . . . . . . . 67
8.3.1 An algorithm . . . . . . . . . . . . . . . . . . . . . . . 67
8.3.2 Why it works . . . . . . . . . . . . . . . . . . . . . . . 68
8.3.3 Almost-matching lower bound . . . . . . . . . . . . . . 69
9 Synchronous agreement 70
9.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 70
9.2 Solution using flooding . . . . . . . . . . . . . . . . . . . . . . 71
9.2.1 Authenticated version . . . . . . . . . . . . . . . . . . 72
9.3 Lower bound on rounds . . . . . . . . . . . . . . . . . . . . . 73
9.4 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10 Byzantine agreement 76
10.1 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.1.1 Minimum number of rounds . . . . . . . . . . . . . . . 76
10.1.2 Minimum number of processes . . . . . . . . . . . . . 76
10.1.3 Minimum connectivity . . . . . . . . . . . . . . . . . . 78
10.1.4 Weak Byzantine agreement . . . . . . . . . . . . . . . 79
10.2 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
10.2.1 Exponential information gathering gets n = 3f + 1 . . 81
10.2.1.1 Proof of correctness . . . . . . . . . . . . . . 83
10.2.2 Phase king gets constant-size messages . . . . . . . . . 84
10.2.2.1 The algorithm . . . . . . . . . . . . . . . . . 84
10.2.2.2 Proof of correctness . . . . . . . . . . . . . . 86
10.2.2.3 Performance of phase king . . . . . . . . . . 87
12 Paxos 93
12.1 The Paxos algorithm . . . . . . . . . . . . . . . . . . . . . . . 93
12.2 Informal analysis: how information flows between rounds . . 97
12.3 Example execution . . . . . . . . . . . . . . . . . . . . . . . . 97
12.4 Safety properties . . . . . . . . . . . . . . . . . . . . . . . . . 99
12.5 Learning the results . . . . . . . . . . . . . . . . . . . . . . . 100
12.6 Liveness properties . . . . . . . . . . . . . . . . . . . . . . . . 100
12.7 Replicated state machines and multi-Paxos . . . . . . . . . . 101
15 Blockchains 122
15.1 Sybil attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
15.1.1 Resource-based defenses . . . . . . . . . . . . . . . . . 124
15.1.2 Limitations of resource-based defenses . . . . . . . . . 125
15.1.3 Alternative defenses . . . . . . . . . . . . . . . . . . . 126
15.2 Bitcoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
15.2.1 Obtaining eventual consistency . . . . . . . . . . . . . 128
15.2.2 Does Bitcoin disprove the folk theorem? . . . . . . . . 131
16 Model 134
16.1 Atomic registers . . . . . . . . . . . . . . . . . . . . . . . . . 134
16.2 Single-writer versus multi-writer registers . . . . . . . . . . . 135
16.3 Fairness and crashes . . . . . . . . . . . . . . . . . . . . . . . 136
16.4 Concurrent executions . . . . . . . . . . . . . . . . . . . . . . 136
16.5 Consistency properties . . . . . . . . . . . . . . . . . . . . . . 137
16.6 Complexity measures . . . . . . . . . . . . . . . . . . . . . . . 139
16.7 Fancier registers . . . . . . . . . . . . . . . . . . . . . . . . . 141
23 Common2 225
23.1 Test-and-set and swap for two processes . . . . . . . . . . . . 226
23.2 Building n-process TAS from 2-process TAS . . . . . . . . . . 226
23.3 Obstruction-free swap from test-and-set . . . . . . . . . . . . 228
23.4 Wait-free swap from test-and-set . . . . . . . . . . . . . . . . 230
23.5 Implementations using stronger base objects . . . . . . . . . . 233
25 Renaming 252
25.1 Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
25.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
25.3 Order-preserving renaming . . . . . . . . . . . . . . . . . . . 254
25.4 Deterministic renaming . . . . . . . . . . . . . . . . . . . . . 254
25.4.1 Wait-free renaming with 2n − 1 names . . . . . . . . . 255
25.4.2 Long-lived renaming . . . . . . . . . . . . . . . . . . . 256
25.4.3 Renaming without snapshots . . . . . . . . . . . . . . 257
25.4.3.1 Splitters . . . . . . . . . . . . . . . . . . . . . 257
25.4.3.2 Splitters in a grid . . . . . . . . . . . . . . . 258
25.4.4 Getting to 2n − 1 names in polynomial space . . . . . 260
25.4.5 Renaming with test-and-set . . . . . . . . . . . . . . . 261
25.5 Randomized renaming . . . . . . . . . . . . . . . . . . . . . . 261
25.5.1 Randomized splitters . . . . . . . . . . . . . . . . . . . 262
25.5.2 Randomized test-and-set plus sampling . . . . . . . . 262
25.5.3 Renaming with sorting networks . . . . . . . . . . . . 263
25.5.3.1 Sorting networks . . . . . . . . . . . . . . . . 263
25.5.3.2 Renaming networks . . . . . . . . . . . . . . 264
25.5.4 Randomized loose renaming . . . . . . . . . . . . . . . 266
27 Obstruction-freedom 275
27.1 Why build obstruction-free algorithms? . . . . . . . . . . . . 276
27.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
27.2.1 Lock-free implementations . . . . . . . . . . . . . . . . 276
27.2.2 Double-collect snapshots . . . . . . . . . . . . . . . . . 276
27.2.3 Software transactional memory . . . . . . . . . . . . . 277
27.2.4 Obstruction-free test-and-set . . . . . . . . . . . . . . 277
27.2.5 An obstruction-free deque . . . . . . . . . . . . . . . . 279
CONTENTS x
28 BG simulation 294
28.1 High-level strategy . . . . . . . . . . . . . . . . . . . . . . . . 295
28.2 Safe agreement . . . . . . . . . . . . . . . . . . . . . . . . . . 295
28.3 The basic simulation algorithm . . . . . . . . . . . . . . . . . 297
28.4 Effect of failures . . . . . . . . . . . . . . . . . . . . . . . . . 298
28.5 Inputs and outputs . . . . . . . . . . . . . . . . . . . . . . . . 298
28.6 Correctness of the simulation . . . . . . . . . . . . . . . . . . 299
28.7 BG simulation and consensus . . . . . . . . . . . . . . . . . . 300
31 Overview 326
32 Self-stabilization 327
32.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
32.2 Token ring circulation . . . . . . . . . . . . . . . . . . . . . . 329
32.3 Synchronizers . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
32.4 Spanning trees . . . . . . . . . . . . . . . . . . . . . . . . . . 334
32.5 Self-stabilization and local algorithms . . . . . . . . . . . . . 335
36 Beeping 361
36.1 Interval coloring . . . . . . . . . . . . . . . . . . . . . . . . . 362
36.1.1 Estimating the degree . . . . . . . . . . . . . . . . . . 363
36.1.2 Picking slots . . . . . . . . . . . . . . . . . . . . . . . 363
36.1.3 Detecting collisions . . . . . . . . . . . . . . . . . . . . 363
36.2 Maximal independent set . . . . . . . . . . . . . . . . . . . . 364
36.2.1 Lower bound . . . . . . . . . . . . . . . . . . . . . . . 364
36.2.2 Upper bound with known bound on n . . . . . . . . . 366
Appendix 370
A Assignments 370
A.1 Assignment 1: due Thursday 2025-01-23, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
A.2 Assignment 2: due Thursday 2025-02-06, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
A.3 Assignment 3: due Thursday 2025-02-20, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
A.4 Assignment 4: due Thursday 2025-03-06, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
A.5 Assignment 5: due Thursday 2025-04-03, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
A.6 Assignment 6: due Thursday 2025-04-17, at 23:59 Eastern US
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Bibliography 514
Index 542
List of Figures
22.1 Snapshot from max arrays; taken from [AACHE15, Fig. 2] . . 223
xviii
List of Tables
xix
List of Algorithms
12.1 Paxos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xx
LIST OF ALGORITHMS xxi
G.2 Two-process consensus using the object from Problem G.5.1 . 469
G.3 Implementation of a rotate register . . . . . . . . . . . . . . . 472
G.4 Randomized two-process test-and-set for G.6.2 . . . . . . . . . 473
G.5 Mutex using a swap object and register . . . . . . . . . . . . . 478
These are notes for the Yale course CPSC 465/565 Theory of Distributed
Systems. This document also incorporates the lecture schedule and assign-
ments, as well as some sample assignments from previous semesters. Because
this is a work in progress, it will be updated frequently over the course of
the semester.
The most recent version of these notes will be available at https:
//www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf. More stable
archival versions may be found at https://arxiv.org/abs/2001.04235.
Not all topics in the notes will be covered during a particular semester.
Some chapters have not been updated and are marked as possibly out of
date.
Much of the structure of the course follows Attiya and Welch’s Dis-
tributed Computing [AW04], with some topics based on Lynch’s Distributed
Algorithms [Lyn96] and additional readings from the research literature. In
most cases you’ll find these materials contain much more detail than what is
presented here, so it may be better to consider this document a supplement
to them than to treat it as your primary source of information.
Acknowledgments
Many parts of these notes were improved by feedback from students taking
various versions of this course, as well as others who have kindly pointed
out errors in the notes after reading them online. Many of these suggestions,
sadly, went unrecorded, so I must apologize to the many students who
should be thanked here but whose names I didn’t keep track of in the past.
However, I can thank Mike Marmar and Hao Pan in particular for suggesting
improvements to some of the posted solutions, Guy Laden for suggesting
xxiv
PREFACE xxv
corrections to Figure 12.1, and Ali Mamdouh for pointing out an error in
the original presentation of Algorithm 5.2.
Syllabus
Description
Models of asynchronous distributed computing systems. Fundamental con-
cepts of concurrency and synchronization, communication, reliability, topo-
logical and geometric constraints, time and space complexity, and distributed
algorithms.
Meeting times
Lectures are Mondays and Wednesdays, from 14:30 to 15:45 in SSS 114.
Staff
The instructor for the course is James Aspnes. Office: AKW 401. Email:
[email protected]. URL: https://www.cs.yale.edu/homes/aspnes/.
The teaching fellows are:
• Weijie Wang [email protected].
xxvi
SYLLABUS xxvii
Office hours for all course staff can be found in the calendar on James
Aspnes’s web page.
Textbook
The primary course textbook is the lecture notes.
You may also find it helpful to look at the textbook on which the notes
were originally based:
Hagit Attiya and Jennifer Welch, Distributed Computing: Fundamentals,
Simulations, and Advanced Topics, second edition. Wiley, 2004. QA76.9.D5
A75X 2004 (LC). ISBN 0471453242.
On-line version: https://dx.doi.org/10.1002/0471478210. (This may
not work outside Yale.)
Errata: http://www.cs.technion.ac.il/~hagit/DC/2nd-errata.html.
Course requirements
If you are taking the class as CPSC 465: Six graded assignments (100% of
the semester grade).
If you are taking the class as CPSC 565: Six graded assignments (85% of
the semester grade), plus a brief presentation (15%).
Each presentation will be a short description of the main results in a
relevant paper chosen in consultation with the instructor, and (circumstances
permitting) will be done live during one of the last few lecture slots. If
numbers and time permit, it may be possible to negotiate doing a presentation
even if you are taking the class as CPSC 465.
Late assignments
Late assignments will not be accepted without a Dean’s Excuse.
As always, the future is uncertain, so you should take parts of the schedule
that haven’t happened yet with a grain of salt. Unless otherwise specified,
readings refer to chapters or sections in the course notes.
xxix
LECTURE SCHEDULE xxx
Introduction
1
CHAPTER 1. INTRODUCTION 2
automaton-based models.
What this means is that we model each process in the system as an
automaton that has some sort of local state, and model local computation
as a transition rule that tells us how to update this state in response to
various events. Depending on what kinds of system we are modeling, these
events might correspond to local computation, to delivery of a message by a
network, carrying out some operation on a shared memory, or even something
like a chemical reaction between two molecules. The transition rule for a
system specifies how the states of all processes involved in the event are
updated, based on their previous states. We can think of the transition
rule as an arbitrary mathematical function (or relation if the processes are
nondeterministic); this corresponds in programming terms to implementing
local computation by processes as a gigantic table lookup.
Obviously this is not how we program systems in practice. But what this
approach does is allow us to abstract away completely from how individual
processes work, and emphasize how all of the processes interact with each
other. This can lead to odd results: for example, it’s perfectly consistent
with this model for some process to be able to solve the halting problem, or
carry out arbitrarily complex calculations between receiving a message and
sending its response. A partial justification for this assumption is that in
practice, the multi-millisecond latencies in even reasonably fast networks are
eons in terms of local computation. And as with any assumption, we can
always modify it if it gets us into trouble.
1.1 Models
The global state consisting of all process states is called a configuration,
and we think of the system as a whole as passing from one global state
or configuration to another in response to each event. When this occurs
the processes participating in the event update their states, and the other
processes do nothing. This does not model concurrency directly; instead,
we interleave potentially concurrent events in some arbitrary way. The
advantage of this interleaving approach is that it gives us essentially the
same behavior as we would get if we modeled simultaneous events explicitly,
but still allows us to consider only one event at a time and use induction to
prove various properties of the sequence of configurations we might reach.
We will often use lowercase Greek letters for individual events or sequences
of events. Configurations are typically written as capital Latin letters (often
C). An execution of a schedule is an alternating sequence of configurations
CHAPTER 1. INTRODUCTION 3
In the simplest case, the objects are simple memory cells supporting
read and write operations. These are called atomic registers. But
in general, the objects could be more complex hardware primitives
like compare-and-swap (§19.2.3), load-linked/store-conditional
(§19.2.3), atomic queues, or even more exotic objects from the seldom-
visited theoretical depths.
Practical shared-memory systems may be implemented as distributed
shared-memory (Chapter 17) on top of a message-passing system.
This gives an alternative approach to designing message-passing systems
if it turns out that shared memory is easier to use for a particular
problem.
Like message-passing systems, shared-memory systems must also deal
with issues of asynchrony and failures, both in the processes and in the
shared objects.
Realistic shared-memory systems have additional complications, in that
modern CPUs allow out-of-order execution in the absence of special
(and expensive) operations called fences or memory barriers.[AG95]
We will effectively be assuming that our shared-memory code is liberally
sprinkled with these operations so that nothing surprising happens,
but this is not always true of real production code, and indeed there is
work in the theory of distributed computing literature on algorithms
that don’t require unlimited use of memory barriers.
We’ll see many of these at some point in this course, and examine which
of them can simulate each other under various conditions.
CHAPTER 1. INTRODUCTION 5
1.2 Properties
Properties we might want to prove about a system include:
• Safety properties, of the form “nothing bad ever happens” or, more
precisely, “there are no bad reachable configurations.” These include
things like “at most one of the traffic lights at the intersection of Busy
Road and Main Street is ever green” or “every value read from a counter
equals the number of preceding increment operations.” Such properties
are typically proved using an , a property of configurations that is true
initially and that is preserved by all transitions (this is essentially a
disguised induction proof).
There are some basic proof techniques that we will see over and over
again in distributed computing.
For lower bound and impossibility proofs, the main tool is the in-
distinguishability argument. Here we construct two (or more) executions
in which some process has the same input and thus behaves the same way,
regardless of what algorithm it is running. This exploitation of process’s ig-
norance is what makes impossibility results possible in distributed computing
despite being notoriously difficult in most areas of computer science.2
For safety properties, statements that some bad outcome never occurs,
the main proof technique is to construct an invariant. An invariant is
essentially an induction hypothesis on reachable configurations of the system;
an invariant proof shows that the invariant holds in all initial configurations,
and that if it holds in some configuration, it holds in any configuration that
is reachable in one step.
Induction is also useful for proving termination and liveness properties,
statements that some good outcome occurs after a bounded amount of time.
Here we typically structure the induction hypothesis as a progress measure,
where we argue that each time unit causes the progress measure to advance
by some predictable amount, and that when the progress measure reaches a
particular value, our desired outcome is achieved.
2
An exception might be lower bounds for data structures, which also rely on a process’s
ignorance.
Part I
Message passing
7
Chapter 2
Model
8
CHAPTER 2. MODEL 9
each triple Ci φi+1 Ci+1 is consistent with the transition rules for the event
φi+1 , and the last element of the sequence (if any) is a configuration. If the
first configuration C0 is an initial configuration of the system, we have an
execution. A schedule is an execution with the configurations removed.
1 initially do
2 send request to server
3 upon receiving response do
4 update state
The interpretation of Algorithm 2.1 is that the client sends request (by
adding it to its outbuf) in its very first computation event (after which it does
nothing). The interpretation of Algorithm 2.2 is that in any computation
event where the server observes request in its inbuf, it sends response.
We want to claim that the client eventually receives response in any
admissible execution. To prove this, observe that:
1. After finitely many steps, the client carries out a computation event.
This computation event puts request in the message buffer between the
client and server.
2. After finitely many more steps, a delivery event occurs that delivers
request to the server. This causes the server to send response.
p3
p2
p1
Time →
Pictures like Figure 2.1 can be helpful for illustrating the various con-
straints we might put on message delivery. In Figure 2.1, the system is
completely asynchronous: messages can be delivered in any order, even if
sent between the same processes. If we run the same protocol under stronger
assumptions, we will get different communication patterns.
For example, Figure 2.2 shows an execution that is still asynchronous but
that assumes FIFO (first-in first-out) channels. A FIFO channel from some
process p to another process q guarantees that q receives messages in the
same order that p sends them (this can be simulated by a non-FIFO channel
by adding a sequence number to each message, and queuing messages at
the receiver until all previous messages have been processed).
If we go as far as to assume synchrony, we get the execution in Figure 2.3.
Now all messages take exactly one time unit to arrive, and computation
events follow each other in lockstep.
p3
p2
p1
Time →
p3
p2
p1
Time →
p3
1 1
p2
1 1
p1
0 2 2
Time →
time t, then the time for the delivery of m from i to j is no greater than t + 1,
and (c) any computation step is assigned a time no later than the previous
event at the same process (or 0 if the process has no previous events). This
is consistent with an assumption that message propagation takes at most 1
time unit and that local computation takes 0 time units.
Another way to look at this is that it is a definition of a time unit in terms
of maximum message delay together with an assumption that message delays
dominate the cost of the computation. This last assumption is pretty much
always true for real-world networks with any non-trivial physical separation
between components, thanks to speed of light limitations.
An example of an execution annotated with times in this way is given in
Figure 2.4.
The time complexity of a protocol (that terminates) is the time of the
last event at any process.
Note that looking at step complexity, the number of computation
events involving either a particular process (individual step complexity)
or all processes (total step complexity) is not useful in the asynchronous
model, because a process may be scheduled to carry out arbitrarily many
computation steps without any of its incoming or outgoing messages being
delivered, which probably means that it won’t be making any progress. These
complexity measures will be more useful when we look at shared-memory
models (Part II).
For a protocol that terminates, the message complexity is the total
number of messages sent. We can also look at message length in bits, total
bits sent, and so on, if these are useful for distinguishing our new improved
protocol from last year’s model.
For synchronous systems, time complexity becomes just the number of
rounds until a protocol finishes. Message complexity is still only loosely
connected to time complexity; for example, there are synchronous leader
CHAPTER 2. MODEL 16
3.1 Flooding
Flooding is about the simplest of all distributed algorithms. It’s dumb and
expensive, but easy to implement, and gives you both a broadcast mechanism
and a way to build rooted spanning trees.
We’ll give a fairly simple presentation of flooding roughly following
Chapter 2 of [AW04]. For more recent work on flooding see [HT23].
17
CHAPTER 3. BROADCAST AND CONVERGECAST 18
1 initially do
2 if pid = root then
3 seen-message ← true
4 send M to all neighbors
5 else
6 seen-message ← false
7 upon receiving M do
8 if seen-message = false then
9 seen-message ← true
10 send M to all neighbors
Note that the time complexity proof also demonstrates correctness: every
process receives M at least once.
As written, this is a one-shot algorithm: you can’t broadcast a second
message even if you wanted to. The obvious fix is for each process to
remember which messages it has seen and only forward the new ones (which
costs memory) and/or to add a time-to-live (TTL) field on each message
that drops by one each time it is forwarded (which may cost extra messages
and possibly prevents complete broadcast if the initial TTL is too small).
The latter method is what was used for searching in http://en.wikipedia.
org/wiki/Gnutella, an early peer-to-peer system. An interesting property
of Gnutella was that since the application of flooding was to search for huge
(multiple MiB) files using tiny ( 100 byte) query messages, the actual bit
complexity of the flooding algorithm was not especially large relative to the
bit complexity of sending any file that was found.
We can optimize the algorithm slightly by not sending M back to the
node it came from; this will slightly reduce the message complexity in many
CHAPTER 3. BROADCAST AND CONVERGECAST 19
cases but makes the proof a sentence or two longer. (It’s all a question of
what you want to optimize.)
1 initially do
2 if pid = root then
3 parent ← root
4 send M to all neighbors
5 else
6 parent ← ⊥
We can easily prove that Algorithm 3.2 has the same termination proper-
ties as Algorithm 3.1 by observing that if we map parent to seen-message by
the rule ⊥ → false, anything else → true, then we have the same algorithm.
We would like one additional property, which is that when the algorithm
quiesces (has no outstanding messages), the set of parent pointers form a
rooted spanning tree. For this we use induction on time:
Lemma 3.1.2. At any time during the execution of Algorithm 3.2, the
following invariant holds:
Proof. We have to show that the invariant is true initially, and that any
event preserves the invariant. We’ll assume that all events are delivery events
CHAPTER 3. BROADCAST AND CONVERGECAST 20
for a single message, since we can have the algorithm treat a multi-message
delivery event as a sequence of single-message delivery events.
We’ll treat the initial configuration as the result of the root setting its
parent to itself and sending messages to all its neighbors. It’s not hard to
verify that the invariant holds in the resulting configuration.
For a delivery event, let v receive M from u. There are two cases: if
v.parent is already non-null, the only state change is that M is no longer in
transit, so we don’t care about u.parent any more. If v.parent is null, then
1. v.parent is set to u. This triggers the first case of the invariant. From
the induction hypothesis we have that u.parent = 6 ⊥ and that there
exists a path from u to the root. Then v.parent.parent = u.parent 6= ⊥
and the path from v → u → root gives the path from v.
At the end of the algorithm, the invariant shows that every process has a
path to the root, i.e., that the graph represented by the parent pointers is
connected. Since this graph has exactly |V | − 1 edges (if we don’t count the
self-loop at the root), it’s a tree.
Though we get a spanning tree at the end, we may not get a very good
spanning tree. For example, suppose our friend the adversary picks some
Hamiltonian path through the network and delivers messages along this
path very quickly while delaying all other messages for the full allowed 1
time unit. Then the resulting spanning tree will have depth |V | − 1, which
might be much worse than D. If we want the shallowest possible spanning
tree, we need to do something more sophisticated: see the discussion of
distributed breadth-first search in Chapter 4. However, we may be
happy with the tree we get from simple flooding: if the message delay on
each link is consistent, then it’s not hard to prove that we in fact get a
shortest-path tree. As a special case, flooding always produces a BFS tree in
the synchronous model.
Note also that while the algorithm works in a directed graph, the parent
pointers may not be very useful if links aren’t two-way.
asynchronous, this requires each neighbor to inform the node both whether it
is a child (using an ack message) and when it is not (using a nack message);
only upon receiving one or the other of these messages will the node know
that it’s not going to receive the other.
The modified code is given in Algorithm 3.3
1 initially do
2 nonChildren ← ∅
3 if pid = root then
4 parent ← root
5 children ← {root}
6 send M to all neighbors
7 else
8 parent ← ⊥
9 children ← ∅
If we take an execution of Algorithm 3.3 and remove all the ack and nack
messages, we get an execution of Algorithm 3.2. So all of the properties that
we proved for Algorithm 3.2 continue to hold.
For the improved algorithm, we’d like to show that when the algorithm
quiesces, every node pi has a list of all the nodes pj for which pj .parent = pi
in pi .children and a list of all the neighbors pj for which pj .parent 6= pi in
pi .nonChildren.
We can do this by showing a mix of safety and liveness properties:
CHAPTER 3. BROADCAST AND CONVERGECAST 22
Since we assume that each pi knows which nodes are its neighbors, we
can use the property that pi .children ∪ pi .nonChildren includes all neighbors
as a kind of local termination test. This can be handy if we want to use
flooding as the first step in some larger protocol.
3.2 Convergecast
A convergecast is the inverse of broadcast: instead of a message propagating
down from a single root to all nodes, data is collected from outlying nodes
to the root. Typically some function is applied to the incoming data at
each node to summarize it, with the goal being that eventually the root
obtains this function of all the data in the entire system. (Examples would
be counting all the nodes or taking an average of input values at all the
nodes.)
A basic convergecast algorithm is given in Algorithm 3.4; it propagates
information up through a previously-computed spanning tree.
The details of what is being computed depend on the choice of f :
• If input = 1 for all nodes and f is sum, then we count the number of
nodes in the system.
• If input is arbitrary and f is sum, then we get a total of all the input
values.
• Combining the above lets us compute averages, by dividing the total
of all the inputs by the node count.
CHAPTER 3. BROADCAST AND CONVERGECAST 23
1 initially do
2 if I am a leaf then
3 send input to parent
1 initially do
2 children ← ∅
3 nonChildren ← ∅
4 if pid = root then
5 parent ← root
6 send init to all neighbors
7 else
8 parent ← ⊥
By carefully arranging for the pi pi+1 links to run much faster than the p0 pi
links, the adversary can make flooding build a tree that consists of a single
path p0 p1 p2 . . . pn−1 , even though the diameter of the network is only 2.
While it only takes 2 time units to build this tree (because every node is only
one hop away from the initiator), when we run convergecast we suddenly
find that the previously-speedy links are now running only at the guaranteed
≤ 1 time unit per hop rate, meaning that convergecast takes n − 1 time.
This may be less of an issue in real networks, where the latency of links
may be more uniform over time, meaning that a deep tree of fast links is
still likely to be fast when we reach the convergecast step. But in the worst
case we will need to be more clever about building the tree. We show how
to do this in Chapter 4.
Chapter 4
Distributed breadth-first
search
26
CHAPTER 4. DISTRIBUTED BREADTH-FIRST SEARCH 27
1 initially do
2 if pid = initiator then
3 distance ← 0
4 send distance to all neighbors
5 else
6 distance ← ∞
then it sets it to one more than the length of some path from initiator to p0 ,
which is the length of that same path extended by adding the pp0 edge.
The initiator sends exactly(0) to all neighbors at the start of the protocol
(these are the only messages the initiator sends).
My distance will be the unique distance that I am allowed to send in an
exactly(d) messages. Note that this algorithm terminates in the sense that
every node learns its distance at some finite time.
If you read the discussion of synchronizers in Chapter 7, this algorithm
essentially corresponds to building the alpha synchronizer into the syn-
chronous BFS algorithm, just as the layered model builds in the beta
synchronizer. See [AW04, §11.3.2] for a discussion of BFS using synchro-
nizers. The original approach of applying synchronizers to get BFS is due to
Awerbuch [Awe85].
We now show correctness. Under the assumption that local computation
takes zero time and message delivery takes at most 1 time unit, we’ll show
that if d(initiator, p) = d, (a) p sends more-than(d0 ) for any d0 < d by time
d0 , (b) p sends exactly(d) by time d, (c) p never sends more-than(d0 ) for any
1
In an earlier version of these notes, these messages where called distance(d) and
not-distance(d); the more self-explanatory exactly and more-than terminology is taken from
[BDLP08].
CHAPTER 4. DISTRIBUTED BREADTH-FIRST SEARCH 30
d0 ≥ d, and (d) p never sends exactly(d0 ) for any d0 6= d. For parts (c) and
(d) we use induction on d0 ; for (a) and (b), induction on time. This is not
terribly surprising: (c) and (d) are safety properties, so we don’t need to
talk about time. But (a) and (b) are liveness properties so time comes in.
Let’s start with (c) and (d). The base case is that the initiator never
sends any more-than messages at all, and so never sends more-than(0), and
any non-initiator never sends exactly(0). For larger d0 , observe that if a
non-initiator p sends more-than(d0 ) for d0 ≥ d, it must first have received
more-than(d0 − 1) from all neighbors, including some neighbor p0 at distance
d−1. But the induction hypothesis tells us that p0 can’t send more-than(d0 −1)
for d0 − 1 ≥ d − 1. Similarly, to send exactly(d0 ) for d0 < d, p must first
have received exactly(d0 − 1) from some neighbor p0 , but again p0 must be at
distance at least d − 1 from the initiator and so can’t send this message either.
In the other direction, to send exactly(d0 ) for d0 > d, p must first receive
more-than(d0 − 2) from this closer neighbor p0 , but then d0 − 2 > d − 2 ≥ d − 1
so more-than(d0 − 2) is not sent by p0 .
Now for (a) and (b). The base case is that the initiator sends exactly(0)
to all nodes at time 0, giving (a), and there is no more-than(d0 ) with d0 < 0
for it to send, giving (b) vacuously; and any non-initiator sends more-than(0)
immediately. At time t + 1, we have that (a) more-than(t) was sent by any
node at distance t + 1 or greater by time t and (b) exactly(t) was sent by
any node at distance t by time t; so for any node at distance t + 2 we
send more-than(t + 1) no later than time t + 1 (because we already received
more-than(t) from all our neighbors) and for any node at distance t + 1 we
send exactly(t + 1) no later than time t + 1 (because we received all the
preconditions for doing so by this time).
Message complexity: A node at distance d sends more-than(d0 ) for all
0 < d0 < d and exactly(d) and no other messages. So we have message
complexity bounded by |E| · D in the worst case. Note that this is gives a
bound of O(DE), which is slightly worse than the O(E + DV ) bound for
the layered algorithm.
Time complexity: It’s immediate from (a) and (b) that all messages that
are sent are sent by time D, and indeed that any node p learns its distance
at time d(initiator, p). So we have optimal time complexity, at the cost of
higher message complexity. I don’t know if this trade-off is necessary, or if a
more sophisticated algorithm could optimize both.
Our time proof assumes that messages don’t pile up on edges, or that
such pile-ups don’t affect delivery time (this is the default assumption used
in [AW04]). A more sophisticated proof could remove this assumption.
One downside of this algorithm is that it has to be started simultaneously
CHAPTER 4. DISTRIBUTED BREADTH-FIRST SEARCH 31
Leader election
32
CHAPTER 5. LEADER ELECTION 33
5.1 Symmetry
A system exhibits symmetry if we can permute the nodes without changing
the behavior of the system. More formally, we can define a symmetry
as an equivalence relation on processes, where we have the additional
properties that all processes in the same equivalence class run the same code;
and whenever p is equivalent to p0 , each neighbor q of p is equivalent to a
corresponding neighbor q 0 of p0 .
An example of a network with a lot of symmetries would be an anony-
mous ring, which is a network in the form of a cycle (the ring part) in
which every process runs the same code (the anonymous part). In this case
all nodes are equivalent. If we have a line, then we might or might not have
any non-trivial symmetries: if each node has a sense of direction that tells
it which neighbor is to the left and which is to the right, then we can identify
each node uniquely by its distance from the left edge. But if the nodes don’t
have a sense of direction, we can flip the line over and pair up nodes that
map to each other.1
Symmetries are convenient for proving impossibility results, as observed by
Angluin [Ang80]. The underlying theme is that without some mechanism for
symmetry breaking, a message-passing system escape from a symmetric
1
Typically, this does not mean that the nodes can’t tell their neighbors apart. But it
does mean that if we swap the labels for all the neighbors (corresponding to flipping the
entire line from left to right), we get the same executions.
CHAPTER 5. LEADER ELECTION 34
1 initially do
2 leader ← 0
3 maxId ← idi
4 send idi to clockwise neighbor
5 upon receiving j do
6 if j = idi then
7 leader ← 1
8 if j > maxId then
9 maxId ← j
10 send j to clockwise neighbor
eventually have its value forwarded through all of the other processes, causing
it to eventually set its leader bit to 1.
Looking closely at this intuition we see that (a) is a safety property and
(b) a liveness property. So we obtain a proof of correctness by converting (a)
into an invariant that for each pi 6= pmax , idi is never sent by any process in
the range pmax . . . pi−1 ; and converting (b) into an induction argument that
each process pmax +j sends idmax to pmax +j+1 no later than time j. Because
the code only has a process pi set leader to 1 if it receives idi from pi−1 , the
invariant tells us that no pi 6= pmax becomes the leader, while the induction
argument tells use that eventually pmax does.
5.2.1.1 Performance
It’s immediate from the correctness proof that the protocol elects a leader
within at most n time in the asynchronous model or exactly n rounds in a
synchronous model.
To bound message traffic, observe that each process sends at most one
copy of each of the n process IDs, for a total of O(n2 ) messages. This is a
tight bound since if the IDs are in decreasing order n, n − 1, n − 2, . . . 1, then
no messages get eaten until they hit n.
There is a subtlety with the termination guarantee: at the moment
the unique leader pmax sets its leader bit, the other processes all have
maxId = idmax , but they don’t actually know that they have the correct
leader ID, since there is no information available locally at a non-leader
process that allows it to detect that there can’t be some larger ID out there
that just hasn’t reached it yet. As with all leader election algorithms, we can
have the leader confirm its election with an additional broadcast protocol,
which in this case raises the time complexity from n to 2n (still O(n)) and
adds an extra n messages (still O(n2 ) in total).
since the reverse edges are needed to send back responses to probes.
To specify the protocol, it may help to think of messages as mobile agents
and the state of each process as being of the form (local-state, {agents I’m carrying}).
Then the sending rule for a process becomes ship any agents in whatever
direction they want to go and the transition rule is accept any incoming
agents and update their state in terms of their own internal transition rules.
An agent state for LCR will be something like (original-sender, direction,
hop-count, max-seen) where direction is R or L depending on which way the
agent is going, hop-count in phase k is initially 2k when the agent is sent
and drops by 1 each time the agent moves, and max-seen is the biggest ID of
any node the agent has visited. An agent turns around (switches direction)
when hop-count reaches 0.
To prove this works, we can mostly ignore the early phases (though we
have to show that the max-id node doesn’t drop out early, which is not too
hard). The last phase involves any surviving node probing all the way around
the ring, so it will declare itself leader only when it receives its own agent
from the left. That exactly one node does so is immediate from the same
argument for LCR.
Complexity analysis is mildly painful but basically comes down to the
fact that any node that sends a message 2k hops had to be a winner in phase
2k − 1, which means that it is the largest of some group of 2k−1 IDs. Thus
the 2k -hop senders are spaced at least 2k−1 away from each other and there
are at most n/2k−1 of them. Summing up over all dlg ne phases, we get
Pdlg ne k k−1 = O(n log n) messages and Pdlg ne 2k = O(n) time.
k=0 2 n/2 k=0
1 procedure candidate()
2 phase ← 0
3 current ← pid
4 while true do
5 send probe(phase, current)
6 wait for probe(phase, x)
7 id2 ← x
8 send probe(phase + 1/2, id2 )
9 wait for probe(phase + 1/2, x)
10 id3 ← x
11 if id2 = current then
12 I am the leader!
13 return
14 else if id2 > current and id2 > id3 do
15 current ← id2
16 phase ← phase + 1
17 else
18 switch to relay()
19 procedure relay()
20 upon receiving probe(p, i) do
21 send probe(p, i)
assumption, but requires that the algorithm can’t do anything to IDs but
copy and compare them.
35
30
25
20
15
10
0
0 5 10 15 20 25 30 35
of each equivalence class after k active rounds, giving Ω(n/k) messages per
active round, which sums to Ω(n log n).
For non-comparison-based protocols we can still prove Ω(n log n) messages
for time-bounded protocols, but it requires techniques from Ramsey theory,
the branch of combinatorics that studies when large enough structures in-
evitably contain substructures with certain properties.5 Here “time-bounded”
means that the running time can’t depend on the size of the ID space. See
[AW04, §3.4.2] or [Lyn96, §3.7] for the textbook version, or [FL87, §7] for
the original result.
The intuition is that for any fixed protocol, if the ID space is large
enough, then there exists a subset of the ID space where the protocol
acts like a comparison-based protocol. So the existence of an O(f (n))-
message time-bounded protocol implies the existence of an O(f (n))-message
comparison-based protocol, and from the previous lower bound we know
f (n) is Ω(n log n). Note that time-boundedness is necessary: we can’t prove
5
The classic example is Ramsey’s Theorem, which says that if you color the edges of
a complete graph red or blue, while trying to avoid having any subsets of k vertices with
all edges between them the same color, you will no longer be able to once the graph is large
enough (for any fixed k). See [GRS90] for much more on the subject of Ramsey theory.
CHAPTER 5. LEADER ELECTION 45
46
CHAPTER 6. CAUSAL ORDERING AND LOGICAL CLOCKS 47
there exists an execution indistinguishable from the real one that contains
this configuration. Causal ordering is the tool that lets us argue that this
hypothetical execution exists.
1. All pairs (e, e0 ) where e precedes e0 in S and e and e0 are events of the
same process.
2. All pairs (e, e0 ) where e is a send event and e0 is the receive event for
the same message.
3. All pairs (e, e0 ) where there exists a third event e00 such that e ⇒ e00
S
and e00 ⇒ e0 . (In other words, we take the transitive closure of the
S
relation defined by the previous two cases.)
It is not terribly hard to show that this gives a partial order; the main
observation is that if e ⇒ e0 , then e precedes e0 in S. So ⇒ is a subset of the
S S
total order <S given by the order of events in S.
CHAPTER 6. CAUSAL ORDERING AND LOGICAL CLOCKS 48
1. S 0 is a causal shuffle of S.
1. e and e0 are events of the same process p and e <S e0 . But then e <S 0 e0
because S|p = S 0 |p.
In both cases, we are using the fact that if I tell you ⇒, then you know
S
everything there is to know about the order of events in S that you can
deduce from reports from each process together with the fact that messages
don’t travel back in time.
In the case that we want to use this information inside an algorithm, we
run into the issue that ⇒ is a pretty big relation (Θ(|S|2 ) bits with a naive
S
encoding), and seems to require global knowledge of <S to compute. So we
can ask if there is some simpler, easily computable description that works
almost as well. This is where logical clocks come in.
Proof. Let e <L e0 if e has a lower clock value than e0 . If e and e0 are two
events of the same process, then e <L e0 . If e and e0 are send and receive
events of the same message, then again e <L e0 . So for any events e, e0 , if
e ⇒ e0 , then e <L e0 . Now apply Lemma 6.1.1.
S
Theorem 6.2.3. Fix a schedule S; then for any e, e0 , V C(e) < V C(e0 ) if
and only if e ⇒ e0 .
S
Proof. We’ll start by showing that for any event e at a process p, the value
of VC(e)q for any q 6= p is equal to the max VC(e0 )q for any event e0 of q such
that e0 ⇒ e, or 0 if there is no such e0 .
S
The proof is by induction on the schedule so far.
If e is a local event or a send event, then there is either no preceding event
at the same process (and thus no event e0 of q with e0 ⇒ e) and VC(e)q = 0
S
as required; or there is some preceding event e00 of p. Since e00 is the only
immediate predecessor of e0 in ⇒, if there is an event e0 of q maximizing
S
VC( e0 )q such that e0 ⇒ e, e0 ⇒ e00 and so VC(e)q = VC(e00 )q = VC(e0 )q as
S S
required.
Alternatively, if e is a receive event, then there is at most one immediately
preceding event e1 of the same process and a send event e2 of the same message
such that VC(e)q = max(VC(e1 )q , VC(e2 ), q). Since any event e0 of q with
e0 ⇒ e has either e0 ⇒ e1 or e ⇒ e2 , we can apply the induction hypothesis
S S S
to both e1 and e2 and then observe that VC(e)q = max(VC(e1 )q , VC(e2 )q )
satisfies the requirements of the induction hypothesis.
Given this characterization of VC(e)q , the if part follows immediately
from the update rules for the vector clock. For events e ⇒ e0 of the same
S
process, observe that both update rules strictly increase that process’s clock,
so VC(e) < VC(e0 ). Similarly the update rule for receiving a message implies
that VC(e) < VC(e0 ) when e and e0 are matching send and receive events,
with the minor issue that we do need to use the observation above to verify
that ep < e0p for the receiver p.
CHAPTER 6. CAUSAL ORDERING AND LOGICAL CLOCKS 53
For the only if part, suppose e does not happen-before e0 . Then e and
e0 are events of distinct processes p and p0 . For VC(e) < VC(e0 ) to hold, we
must have VC(e)p ≤ VC(e0 )p ; but as shown above, this can occur only if
e ⇒ e0 .
S
holds, we will eventually start the snapshot protocol after it holds and obtain
a configuration (which again may not correspond to any global configuration
that actually occurs) in which P holds.
Chapter 7
Synchronizers
7.1 Definitions
Formally, a synchronizer sits between the underlying network and the pro-
cesses and does one of two things:
• A global synchronizer guarantees that no process receives a message
from round r until all processes have sent their messages for round r.
56
CHAPTER 7. SYNCHRONIZERS 57
7.2 Implementations
Here we describe several implementations of synchronizers. All of them give
at least local synchrony. One of them, the beta synchronizer (§7.2.2), also
gives global synchrony.
The names were chosen by their inventor, Baruch Awerbuch [Awe85].
The main difference between them is the mechanism used to determine when
round-r messages have been delivered.
In the alpha synchronizer, every node sends a message to every neigh-
bor in every round (possibly a dummy message if the underlying protocol
doesn’t send a message); this allows the receiver to detect when it’s gotten
all its round-r messages (because it expects to get a message from every
neighbor) but may produce huge blow-ups in message complexity in a dense
graph.
CHAPTER 7. SYNCHRONIZERS 58
• When the root of a tree gets all acks and OK, it sends ready to the
roots of all adjacent trees (and itself). Two trees are adjacent if any of
their members are adjacent.
• When the root collects ready from itself and all adjacent roots, it
broadcasts go through its own tree.
trees in the forest toward the alpha or beta ends of the spectrum, e.g., if
the whole graph is a clique (and we didn’t worry about contention issues),
we might as well just use beta and get O(1) time blowup and O(n) added
messages.
7.3 Applications
See [AW04, §11.3.2] or [Lyn96, §16.5]. The one we have seen is distributed
breadth-first search, where the two asynchronous algorithms we described in
Chapter 4 were essentially the synchronous algorithms with the beta and
alpha synchronizers embedded in them. But what synchronizers give us in
general is the ability to forget about problems resulting from asynchrony
provided we can assume no failures (which may be a very strong assumption)
and are willing to accept a bit of overhead.
We’ll see more examples of this trick of showing that a particular simula-
tion is impossible because it would allow us to violate impossibility results
later, especially when we start looking at the strength of shared-memory
objects in Chapter 19.
this execution into two segments: an initial segment γ that includes all
rounds with special actions, and a suffix δ that includes any extra rounds
where the algorithm is still floundering around. We will mostly ignore δ, but
we have to leave it in to allow for the possibility that whatever is happening
there is important for the algorithm to work (say, to detect termination).
We now want to perform a causal shuffle on γ that leaves it with only
s − 1 sessions. Because causal shuffles don’t affect time complexity, this will
give us a new bad execution γ 0 δ that has only s − 1 sessions despite taking
(s − 1)D time.
The first step is to chop γ into s − 1 segments γ1 , γ2 , . . . γs−1 of at most
D rounds each. Because a message sent in round i is not delivered until
round i + 1, if we have a chain of k messages, each of which triggers the next,
then if the first message is sent in round i, the last message is not delivered
until round i + k. If the chain has length D, its events (including the initial
send and the final delivery) span D + 1 rounds i, i + 1, . . . , i + D. In this
case the initial send and final delivery are necessarily in different segments
γi and γi+1 .
Now pick processes p and q at distance D from each other. Then any
chain of messages starting at p within some segment reaches q after the end
of the segment. It follows that for any events ep of p and eq of q in the same
segment γi , ep 6⇒ eq . So there exists a causal shuffle of γi that puts all events
γδ
of p after all events of q.1 By a symmetrical argument, we can similarly put
all events of q in a segment after all events of p in the same segment. In
both cases the resulting schedule is indistinguishable by all processes from
the original.
So now we apply these shuffles to each of the segments γi in alternating
order: p goes first in the odd-numbered segments and q goes first in the
odd-numbered segments. Let’s write the shuffled version of γi as αi βi for
odd i and βi αi for even i; in each case, αi contains only events of p and other
processes that aren’t q and βi contains only events of q and other processes
that aren’t p.
When we put these alternating shuffles together, we get an execution
that looks like this example with s − 1 = 4:
α1 β1 β2 α2 α3 β3 β4 α4 δ
Now let’s count sessions. Since a session includes special actions by both
1
Proof: Because ep 6⇒ eq , we can add eq < ep for all events eq and ep in γi and still
γδ
have a partial order consistent with ⇒. Now apply topological sort to get the shuffle.
γδ
CHAPTER 7. SYNCHRONIZERS 63
There is one such point for each of our original s − 1 intervals, so we get
at most s − 1 sessions.
This means that any algorithm that runs in time (s − 1)D in the worst
case (here, the original synchronous execution) can’t guarantee to give s
sessions in all cases (it fails in the shuffled asynchronous execution). Note
that this is not quite the same as saying that any execution with at least s
sessions must take (s − 1)D time. Instead, we’ve shown that algorithm that
guarantees we get at least s sessions sometimes takes more than (s − 1)D
time, even though it might sometimes use less time if it gets lucky.
Chapter 8
Coordinated attack
64
CHAPTER 8. COORDINATED ATTACK 65
Validity If all processes have the same input x, and no messages are lost,
all processes produce output x. (If processes start with different inputs
or one or more messages are lost, processes can output 0 or 1 as long
as they all agree.)
Sadly, there is not protocol that satisfies all three conditions. We show
this in the next section.
agreement task, where every process must output the same value. Then since
pi outputs the same value in Ai and Ai+1 , every process outputs the same
value in Ai and Ai+1 . By induction on k, every process outputs the same
value in A and B, even though A and B may be very different executions.
This gives us a tool for proving impossibility results for agreement: show
that there is a path of indistinguishable executions between two executions
that are supposed to produce different output. Another way to picture this:
consider a graph whose nodes are all possible executions with an edge between
any two indistinguishable executions; then the set of output-0 executions
can’t be adjacent to the set of output-1 executions. If we prove the graph is
connected, we prove the output is the same for all executions.
For coordinated attack, we will show that no protocol satisfies all of
agreement, validity, and termination using an indistinguishability argument.
The key idea is to construct a path between the all-0-input and all-1-input
executions with no message loss via intermediate executions that are indis-
tinguishable to at least one process.
Let’s start with A = A0 being an execution in which all inputs are 1 and
all messages are delivered. We’ll build executions A1 , A2 , etc., by pruning
messages. Consider Ai and let m be some message that is delivered in
the last round in which any message is delivered. Construct Ai+1 by not
delivering m. Observe that while Ai is distinguishable from Ai+1 by the
recipient of m, on the assumption that n ≥ 2 there is some other process
that can’t tell whether m was delivered or not (the recipient can’t let that
other process know, because no subsequent message it sends are delivered
in either execution). Continue until we reach an execution Ak in which all
inputs are 1 and no messages are sent. Next, let Ak+1 through Ak+n be
obtained by changing one input at a time from 1 to 0; each such execution
is indistinguishable from its predecessor by any process whose input didn’t
change. Finally, construct Ak+n through Ak+n+k0 by adding back messages
in the reverse process used for A0 through Ak ; note that this might not
result in exactly k new messages, because the number of messages might
depend on the inputs. This gets us to an execution Ak+n+k0 in which all
processes have input 0 and no messages are lost. If agreement holds, then
the indistinguishability of adjacent executions to some process means that
the common output in A0 is the same as in Ak+n+k0 . But validity requires
that A0 outputs 1 and Ak+n+k0 outputs 0: so either agreement or validity is
violated in some execution.
CHAPTER 8. COORDINATED ATTACK 67
8.3.1 An algorithm
Here’s an algorithm that gives = 1/r. (See [Lyn96, §5.2.2] for details
or [VL92] for the original version.) A simplifying assumption is that network
is complete, although a strongly-connected network with r greater than or
equal to the diameter also works.
– Process 1 chooses a random key value uniformly in the range [1, r].
CHAPTER 8. COORDINATED ATTACK 68
– This key is distributed along with leveli [1], so that every process
with leveli [1] ≥ 0 knows the key.
– A process decides 1 at round r if and only if it knows the key,
its information level is greater than or equal to the key, and all
inputs are 1.
• Note that in the preceding, the key value didn’t figure in; so
everybody’s level at round r is independent of the key.
• So now we have that levelri [i] is in {`, ` + 1}, where ` is some fixed
value uncorrelated with the key. The only way to get some process
to decide 1 while others decide 0 is if ` + 1 ≥ key but ` < key. (If
` = 0, a process at this level doesn’t know key, but it can still
reason that 0 < key since key is in [1, r].) This can only occur if
key = ` + 1, which occurs with probability at most 1/r since key
was chosen uniformly.
Synchronous agreement
Validity If all processes start with the same input, all non-faulty processes
decide it.
70
CHAPTER 9. SYNCHRONOUS AGREEMENT 71
For lower bounds, we’ll replace validity with non-triviality (often called
validity in the literature):
Non-triviality follows from validity but doesn’t imply validity; for example,
a non-trivial algorithm might have the property that if all non-faulty processes
start with the same input, they all decide something else.
In §9.2, we’ll show that a simple algorithm gives agreement, termination,
and validity with f failures using f + 1 rounds. We’ll then show in §9.3 that
non-triviality, agreement, and termination imply that f + 1 rounds is the
best possible. In Chapter 10, we’ll show that the agreement is still possible
in f + 1 rounds even if faulty processes can send arbitrary messages instead
of just crashing, but only if the number of faulty processes is strictly less
than n/3.
Lemma 9.2.1. After f + 1 rounds, all non-faulty processes have the same
set.
Proof. Let Sir be the set stored by process i after r rounds. What we’ll really
show is that if there are no failures in round k, then Sir = Sjr = Sik+1 for all
i, j, and r > k. To show this, observe that no faults in round k means that
CHAPTER 9. SYNCHRONOUS AGREEMENT 72
all processes that are still alive at the start of round k send their message
to all other processes. Let L be the set of live processes in round k. At the
end of round k, for i in L we have Sik+1 = j∈L Sjk = S. Now we’ll consider
S
The first and last step apply the induction hypothesis; the middle one yields
indistinguishable executions since only p0 can tell the difference between m
arriving or not and its lips are sealed.
We’ve shown that we can remove one message through a sequence of
executions where each pair of adjacent executions is indistinguishable to
some process. Now paste together n − 1 such sequences (one per message)
to prove the lemma.
The rest of the proof: Crash some process fully in round 0 and then
change its input. Repeat until all inputs are changed.
9.4 Variants
So far we have described binary consensus, since all inputs are 0 or 1. We
can also allow larger input sets. With crash failures, this allows a stronger
validity condition: the output must be equal to some non-faulty process’s
input. It’s not hard to see that Dolev-Strong (§9.2) gives this stronger
condition.
Chapter 10
Byzantine agreement
76
CHAPTER 10. BYZANTINE AGREEMENT 77
A0 B0 A0 B0
C1 C0
Č B1 A1
opera is not a real proof, since we haven’t actually shown that B can say
exactly the right thing to keep A and C from guessing that B is evil.
Here is a real proof, which works by explicitly showing how to construct
a bad execution for any given algorithm.1 Consider an artificial execution
where (non-Byzantine) A, B, and C are duplicated and then placed in a
ring A0 B0 C0 A1 B1 C1 , where the digits indicate inputs. We’ll still keep the
same code for n = 3 on each process, but when A0 tries to send a message
to what it thinks of as just C we’ll send it to C1 while messages from B0
will instead go to C0 . For any adjacent pair of processes (e.g. A0 and B0 ),
the behavior of the rest of the ring could be simulated by a single Byzantine
process (Č), so each process in the 6-process ring behaves just as it does in
some 3-process execution with 1 Byzantine process. It follows that all of the
processes terminate and decide in the unholy 6-process Frankenexecution2
the same value that they would in the corresponding 3-process Byzantine
execution. So what do they decide?
Given two processes with the same input, say, A0 and B0 , the giant
execution is indistinguishable from an A0 B0 Č execution where Č is Byzantine
(see Figure 10.1. Validity says A0 and B0 must both decide 0. Since this
works for any pair of processes with the same input, we have each process
deciding its input. But now consider the execution of C0 A1 B̌, where B̌ is
Byzantine. In the big execution, we just proved that C0 decides 0 and A1
decides 1, but since the C0 A1 B̌ execution is indistinguishable from the big
execution to C0 and A1 , they do the same thing here and violate agreement.
1
The presentation here is based on [AW04, §5.2.3]. The original impossibility result
is due to Pease, Shostak, and Lamport [PSL80]. This particular proof is due to Fischer,
Lynch, and Merritt [FLM86].
2
Not a real word.
CHAPTER 10. BYZANTINE AGREEMENT 78
B0
B0 A0 D0
A0 D0 C0
C1
Č A1
B1
D1
pair: since agreement holds in all Byzantine executions, each adjacent pair
decides the same value in the big execution and so either everybody decides
0 or everybody decides 1 in the big execution.
Now we’ll show that means that validity is violated in some no-failures
3-process execution. We’ll extract this execution by looking at the execution
of processes A0,r/2 B0,r/2 C0,r/2 . The argument is that up to round r, any
input-0 process that is at least r steps in the ring away from the nearest
1-input process acts like the corresponding process in the all-0 no-failures
3-process execution. Since A0,r/2 is 3r/2 > r hops away from A1r and
similarly for C0,r/2 , our 3 stooges all decide 0 by validity. But now repeat
the same argument for A1,r/2 B1,r/2 C1,r/2 and get 3 new stooges that all
decide 1. This means that somewhere in between we have two adjacent
processes where one decides 0 and one decides 1, violating agreement in the
corresponding 3-process execution where the rest of the ring is replaced by a
single Byzantine process. This concludes the proof.
This result is a little surprising: we might expect that weak Byzantine
agreement could be solved by allowing a process to return a default value if
it notices anything that might hint at a fault somewhere. But this would
allow a Byzantine process to create disagreement revealing its bad behavior
to just one other process in the very last round of an execution otherwise
headed for agreement on the non-default value. The chosen victim decides the
default value, but since it’s the last round, nobody else finds out. Even if the
algorithm is doing something more sophisticated, examining the 6r-process
execution will tell the Byzantine process exactly when and how to start
acting badly.
The idea of the algorithm is that in each phase, everybody announces their
current preference (initially the inputs). If the majority of these preferences
is large enough (e.g., all inputs are the same), everybody adopts the majority
preference. Otherwise everybody adopts the preference of the phase king.
The majority rule means that once the processes agree, they continue to
agree despite bad phase kings. The phase king rule allows a good phase king
to end disagreement. By choosing a different king in each phase, after f + 1
phases, some king must be good. This intuitive description is justified below.
Impossibility of
asynchronous agreement
There’s an easy argument that says that you can’t do most things in an
asynchronous message-passing system with n/2 crash failures: partition the
processes into two subsets S and T of size n/2 each, and allow no messages
between the two sides of the partition for some long period of time. Since
the processes in each side can’t distinguish between the other side being
slow and being dead, eventually each has to take action on their own. For
many problems, we can show that this leads to a bad configuration. For
example, for agreement, we can supply each side of the partition with a
different common input value, forcing disagreement because of validity. We
can then satisfy the fairness condition that says all messages are eventually
delivered by delivering the delayed messages across the partition, but it’s
too late for the protocol.
The Fischer-Lynch-Paterson (FLP) result [FLP85] says something much
stronger: you can’t do agreement in an asynchronous message-passing system
if even one crash failure is allowed.1 After its initial publication, it was quickly
generalized to other models including asynchronous shared memory [LAA87],
and indeed the presentation of the result in [Lyn96, §12.2] is given for shared-
memory first, with the original result appearing in [Lyn96, §17.2.3] as a
corollary of the ability of message passing to simulate shared memory. In
these notes, I’ll present the original result; the dependence on the model is
1
Unless you augment the basic model in some way, say by adding randomization
(Chapter 24) or failure detectors (Chapter 13).
88
CHAPTER 11. IMPOSSIBILITY OF ASYNCHRONOUS AGREEMENT89
surprisingly limited, and so most of the proof is the same for both shared
memory (even strong versions of shared memory that support operations
like atomic snapshots2 ) and message passing.
Section 5.3 of [AW04] gives a very different version of the proof, where
it is shown first for two processes in shared memory, then generalized to n
processes in shared memory by adapting the classic Borowsky-Gafni simu-
lation [BG93] to show that two processes with one failure can simulate n
processes with one failure. This is worth looking at (it’s an excellent example
of the power of simulation arguments, and BG simulation is useful in many
other contexts) but we will stick with the original argument, which is simpler.
We will look at this again when we consider BG simulation in Chapter 28.
11.1 Agreement
Usual rules: agreement (all non-faulty processes decide the same value),
termination (all non-faulty processes eventually decide some value), valid-
ity (for each possible decision value, there an execution in which that value
is chosen). Validity can be tinkered with without affecting the proof much.
To keep things simple, we assume the only two decision values are 0 and
1.
11.2 Failures
A failure is an internal action after which all send operations are disabled.
The adversary is allowed one failure per execution. Effectively, this means
that any group of n − 1 processes must eventually decide without waiting
for the n-th, because it might have failed.
11.3 Steps
The FLP paper uses a notion of steps that is slightly different from the
send and receive actions of the asynchronous message-passing model we’ve
been using. Essentially a step consists of receiving zero or more messages
followed by doing a finite number of sends. To fit it into the model we’ve been
using, we’ll define a step as either a pair (p, m), where p receives message
m and performs zero or more sends in response, or (p, ⊥), where p receives
nothing and performs zero or more sends. We assume that the processes are
2
Chapter 20.
CHAPTER 11. IMPOSSIBILITY OF ASYNCHRONOUS AGREEMENT90
deterministic, so the messages sent (if any) are determined by p’s previous
state and the message received. Note that these steps do not correspond
precisely to delivery and send events or even pairs of delivery and send events,
because what message gets sent in response to a particular delivery may
change as the result of delivering some other message; but this won’t affect
the proof.
The fairness condition essentially says that if (p, m) or (p, ⊥) is continu-
ously enabled it eventually happens. Since messages are not lost, once (p, m)
is enabled in some configuration C, it is enabled in all successor configurations
until it occurs; similarly (p, ⊥) is always enabled. So to ensure fairness, we
have to ensure that any non-faulty process eventually performs any enabled
step.
Comment on notation: I like writing the new configuration reached by
applying a step e to C like this: Ce. The FLP paper uses e(C).
2. Now suppose e and e0 are steps of the same process p. Again we let both
go through in either order. It is not the case now that Dee0 = De0 e,
since p knows which step happened first (and may have sent messages
telling the other processes). But now we consider some finite sequence
of steps e1 e2 . . . ek in which no message sent by p is delivered and some
process decides in Dee1 . . . ek (this occurs since the other processes
can’t distinguish Dee0 from the configuration in which p died in D, and
so have to decide without waiting for messages from p). This execution
fragment is indistinguishable to all processes except p from De0 ee1 . . . ek ,
so the deciding process decides the same value i in both executions.
But Dee0 is 0-valent and De0 e is 1-valent, giving a contradiction.
It follows that our assumption was false, and there is some reachable
bivalent configuration C 0 e.
Now to construct a fair execution that never decides, we start with a
bivalent configuration, choose the oldest enabled action and use the above
to make it happen while staying in a bivalent configuration, and repeat.
Paxos
93
CHAPTER 12. PAXOS 94
Implementing these rules require only that each accepter track rack , the
highest number of any proposal for which it sent an ack, and hv, rv i, the last
proposal that it accepted. Pseudocode showing the behavior of proposer and
accepters in the core Paxos protocol is given in Algorithm 12.1.
Note that acceptance is a purely local phenomenon; additional messages
are needed to detect which if any proposals have been accepted by a majority
of accepters. Typically this involves a fourth round, where accepters send
accepted(r, v) to all learners.
There is no requirement that only a single proposal is sent out (indeed,
if proposers can fail we will need to send out more to jump-start the proto-
col). The protocol guarantees agreement and validity no matter how many
proposers there are and no matter how often they start.
CHAPTER 12. PAXOS 96
1 procedure Propose(r, v)
// Issue proposal number r with value v
// Assumes r is unique
2 send prepare(r, v) to all accepters
3 wait to receive ack(r, v 0 , rv0 ) from a majority of accepters
4 if some v 0 is not ⊥ then
5 v ← v 0 with maximum rv0
6 send accept!(r, v) to all accepters
7 procedure accepter()
8 initially do
9 rack ← −∞
10 v←⊥
11 rv ← −∞
12 upon receiving prepare(r) from p do
13 if r > max(rack , rv ) then
// Respond to proposal
14 send ack(r, v, rv ) to p
15 rack ← r
p1 p2 p3 a1 a2 a3
prepare(3)
prepare(2)
prepare(1)
ack(3, ⊥, 0)
ack(1, ⊥, 0)
ack(1, ⊥, 0)
accept!(1, 1)
accepted(1, 1)
nack(1, 3)
ack(2, 1, 1)
ack(2, ⊥, 0)
accept!(2, 1)
nack(2, 3)
accepted(2, 1)
ack(3, 1, 2)
accept!(3, 1)
accepted(3, 1)
accepted(3, 1)
1. Any ack(r0 , v 0 , rv0 ) message received by pr0 has rv0 < r0 . Proof: Imme-
diate from the code.
These two properties together imply that pr0 receives at least one ack(r, v 00 , r00 )
with r ≤ r00 < r0 and no such messages with r00 < r. So the maximum pro-
posal number it sees is r00 where r ≤ r00 < r. By the induction hypothesis,
the corresponding value is v. It follows that pr0 also chooses v.
CHAPTER 12. PAXOS 100
failure detector. There are other still weaker failure detectors that can also
be used to solve consensus. We will discuss failure detectors in detail in
Chapter 13.
Since implementing this kind of leader election allows us to solve consensus,
the FLP result (Chapter 11) implies that we can’t build it using only the tools
available in the asynchronous message-passing model. In practice, detecting
failures and electing a non-faulty leader involves using lots of timeouts. An
example of a Paxos-like protocol that does this is the Raft protocol of Ongaro
and Osterhout [OO14], which may be the most commonly implemented
protocol in this family.
Failure detectors
103
CHAPTER 13. FAILURE DETECTORS 104
Note that “strong” and “weak” mean different things for accuracy vs
completeness: for accuracy, we are quantifying over suspects, and for com-
pleteness, we are quantifying over suspectors. Even a weakly-accurate failure
detector guarantees that all processes trust the one visibly good process.
1 initially do
2 suspects ← ∅
3 while true do
4 Let S be the set of all processes my weak detector suspects.
5 Send S to all processes.
6 upon receiving S from q do
7 suspects ← (suspects ∪ S) \ {q}
Algorithm 13.1: Boosting completeness
CHAPTER 13. FAILURE DETECTORS 106
It’s not hard to see that this boosts completeness: if p crashes, somebody’s
weak detector eventually suspects it, this process tells everybody else, and p
never contradicts it. So eventually everybody suspects p.
What is slightly trickier is showing that it preserves accuracy. The
essential idea is this: if there is some good-guy process p that everybody trusts
forever (as in weak accuracy), then nobody ever reports p as suspect—this
also covers strong accuracy since the only difference is that now every non-
faulty process falls into this category. For eventual weak accuracy, wait
for everybody to stop suspecting p, wait for every message ratting out p
to be delivered, and then wait for p to send a message to everybody. Now
everybody trusts p, and nobody every suspects p again. Eventual strong
accuracy is again similar.
This will justify ignoring the weakly-complete classes.
Jumping to the punch line: P can simulate any of the others, S and
♦P can both simulate ♦S but can’t simulate P or each other, and ♦S can’t
simulate any of the others (See Figure 13.1—we’ll prove all of this later.)
Thus ♦S is the weakest class of failure detectors in this list. However, ♦S is
strong enough to solve consensus, and in fact any failure detector (whatever
CHAPTER 13. FAILURE DETECTORS 107
S ♦P
♦S
Figure 13.1: Partial order of failure detector classes. Higher classes can
simulate lower classes but not vice versa.
1 procedure broadcast(m)
2 send m to all processes.
3 upon receiving m do
4 if I haven’t seen m before then
5 send m to all processes
6 deliver m to myself
1 preference ← input
2 timestamp ← 0
3 for round ← 1 . . . ∞ do
4 Send hround, preference, timestampi to coordinator
5 if I am the coordinator then
6 Wait to receive hround, preference, timestampi from majority of
processes.
7 Set preference to value with largest timestamp.
8 Send hround, preferencei to all processes.
9 Wait to receive round, preference0 from coordinator or to suspect
coordinator.
10 if I received round, preference0 then
11 preference ← preference0
12 timestamp ← round
13 Send ack(round) to coordinator.
14 else
15 Send nack(round) to coordinator.
16 if I am the coordinator then
17 Wait to receive ack(round) or nack(round) from a majority of
processes.
18 if I received no nack(round) messages then
19 Broadcast preference using reliable broadcast.
decide stop participating in the protocol; but because any non-faulty process
retransmits the decision value in the reliable broadcast, if a process is waiting
for a response from a non-faulty process that already terminated, eventually
it will get the reliable broadcast instead and terminate itself. In Phase 3,
a process might get stuck waiting for a dead coordinator, but the strong
completeness of ♦S means that it suspects the dead coordinator eventually
and escapes. So at worst we do finitely many rounds.
Now suppose that after some time t there is a process c that is never
suspected by any process. Then in the next round in which c is the coordi-
nator, in Phase 3 all surviving processes wait for c and respond with ack, c
decides on the current estimate, and triggers the reliable broadcast protocol
to ensure everybody else decides on the same value. Since reliable broadcast
guarantees that everybody receives the message, everybody decides this value
or some value previously broadcast—but in either case everybody decides.
Agreement is the tricky part. It’s possible that two coordinators both
initiate a reliable broadcast and some processes choose the value from the first
and some the value from the second. But in this case the first coordinator
collected acks from a majority of processes in some round r, and all subsequent
coordinators collected estimates from an overlapping majority of processes in
some round r0 > r. By applying the same induction argument as for Paxos,
we get that all subsequent coordinators choose the same estimate as the first
coordinator, and so we get agreement.
the classes. What is trickier is to show that this structure doesn’t collapse:
♦P can’t simulate S, S can’t simulate ♦P , and ♦S can’t simulate any of the
other classes.
First let’s observe that ♦P can’t simulate S: if it could, we would get a
consensus protocol for f ≥ n/2 failures, which we can’t do. It follows that
♦P also can’t simulate P (because P can simulate S).
To show that S can’t simulate ♦P , choose some non-faulty victim process
v and consider an execution in which S periodically suspects v (which it
is allowed to do as long as there is some other non-faulty process it never
suspects). If the ♦P -simulator ever responds to this by refusing to suspect v,
there is an execution in which v really is dead, and the simulator violates
strong completeness. But if not, we violate eventual strong accuracy. Note
that this also implies S can’t simulate P , since P can simulate ♦P . It also
shows that ♦S can’t simulate either of ♦P or P .
We are left with showing ♦S can’t simulate S. Consider a system where
p’s ♦S detector suspects q but not p from the start of the execution. Run p
until p’s S-simulator gives up and suspects q, which it must do eventually by
strong completeness, since this run is indistinguishable from one in which q
is faulty. Then wake up q and crash p. Since q is the only non-faulty process,
and the alleged S-simulator suspected it, we’ve violated weak accuracy.
phases:
Quorum systems
14.1 Basics
In the past few chapters, we’ve seen many protocols that depend on the
fact that if I talk to more than n/2 processes and you talk to more than
n/2 processes, the two groups overlap. This is a special case of a quorum
system, a family of subsets of the set of processes with the property that
any two subsets in the family overlap. By choosing an appropriate family, we
may be able to achieve lower load on each system member, higher availability,
defense against Byzantine faults, etc.
The exciting thing from a theoretical perspective is that these turn
a systems problem into a combinatorial problem: this means we can ask
combinatorialists how to solve it.
• Dynamic quorum systems: get more than half of the most recent copy.
115
CHAPTER 14. QUORUM SYSTEMS 116
14.3 Goals
• Minimize quorum size.
Naor and Wool [NW98] describe trade-offs between these goals (some of
these were previously known, see the paper for citations):
• load ≥ max(c/n, 1/c) where c is the minimum quorum size. The first
case is obvious: if every access hits c nodes, spreading them out as
evenly as possible still hits each node c/n of the time. The second is
trickier: Naor and Wool prove it using LP duality, but the argument
essentially says that if we have some quorum Q of size c, then since
every other quorum Q0 intersects Q in at least one place, we can show
that every Q0 adds at least 1 unit of load in total to the c members of
Q. So if we pick a random quorum Q0 , the average load added to all of
Q is at least 1, so the average load added to some particular element
of Q is at least 1/|Q| = 1/c. Combining the two cases, we can’t hope
√
to get load better than 1/ n, and to get this load we need quorums of
√
size at least n.
CHAPTER 14. QUORUM SYSTEMS 117
Figure 14.1: Figure 2 from [NW98]. Solid lines are G(3); dashed lines are
G∗ (3).
G(d) grid and one from the G∗ (d) grid (the star indicates that G∗ (d) is the
dual graph1 of G(d). A quorum consists of a set of servers that produce an
LR path in G(d) and a TB path in G∗ (d). Quorums intersect, because any
LR path in G(d) must cross some TB path in G∗ (d) at some server (in fact,
each pair of quorums intersects in at least two places). The total number of
√
elements n is (d + 1)2 and the minimum size of a quorum is 2d + 1 = Θ( n).
The symmetry of the mesh gives that there exists a LR path in the
mesh if and only if there does not exist a TB path in its complement, the
graph that has an edge only if the mesh doesn’t. For a mesh with failure
probability p < 1/2, the complement is a mesh with failure probability
q = 1 − p > 1/2. Using results in percolation theory, it can be shown that for
failure probability q > 1/2, the probability that there exists a left-to-right
path is exponentially small in d (formally, for each p there is a constant φ(p)
such that Pr[∃LR path] ≤ exp(−φ(p)d)). We then have
So the failure probability of this system is exponentially small for any fixed
p < 1/2.
See the paper [NW98] for more details.
14.6.1 Example
√
Let a quorum be any set of size k n for some k and let all quorums be
chosen uniformly at random. Pick some quorum Q1 ; what is the probability
that a random Q2 does not intersect Q1 ? Imagine we choose the elements
of Q2 one at a time. The chance that the first element x1 of Q2 misses Q1
√ √
is exactly (n − k n)/n = 1 − k/ n, and conditioning on x1 through xi−1
√
missing Q1 the probability that xi also misses it is (n − k n − i + 1)/(n −
√ √
i + 1) ≤ (n − k n)/n = 1 − k/√ n. So taking the √product over all i gives
√ √
Pr[all miss Q1 ] ≤ (1 − k/ n)k n ≤ exp(−k n)k/ n) = exp(−k 2 ). So by
setting k = Θ(ln 1/), we can get our desired -intersecting system.
14.6.2 Performance
Failure probabilities, if naively defined, can be made arbitrarily small: add
low-probability singleton quorums that are hardly ever picked unless massive
failures occur. But the resulting system is still -intersecting.
One way to look at this is that it points out a flaw in the -intersecting
definition: -intersecting quorums may cease to be -intersecting conditioned
on a particular failure pattern (e.g., when all the non-singleton quorums are
knocked out by massive failures). But Malkhi et al. [MRWW01] address the
problem in a different way, by considering only survival of high quality
quorums, where a particular quorum Q is δ-high-quality if Pr[Q1 ∩ Q2 =
√
∅|Q1 = Q] ≤ δ and high quality if it’s -high-quality. It’s not hard to show
that a random quorum is δ-high-quality with probability at least /δ, so a
high quality quorum is one that fails to intersect a random quorum with
√
probability at most and a high quality quorum is picked with probability
√
at least 1 − .
We can also consider load; Malkhi et al. [MRWW01] show that essentially
the same bounds on load for strict quorum systems also hold for -intersecting
√
quorum systems: load(S) ≥ max((E(|Q|)/n, (1− )2 / E(|Q|)), where E(|Q|)
is the expected size of a quorum. The left-hand branch of the max is just
the average load applied to a uniformly-chosen server. For the right-hand
side, pick some high quality quorum Q0 with size less than or equal to
√
(1 − ) E(|Q|) and consider the load applied to its most loaded member by
√
its nonempty intersection (which occurs with probability at least 1 − )
with a random quorum.
CHAPTER 14. QUORUM SYSTEMS 121
Blockchains
122
CHAPTER 15. BLOCKCHAINS 123
system; instead, we are worrying about the case where a bad router can
claim to have 10,000,000 bad nodes behind it but these nodes are simulated
by only a small number of machines.
Sybil attacks based on the structure of social networks. The idea is that a
social network graph with many Sybil nodes is likely to decompose into a
subnetwork consisting mostly of legitimate nodes and a subnetwork consisting
mostly of counterfeit nodes, with the majority of links between nodes within
each subnetwork and few links between legitimate nodes and counterfeit
nodes. This approach is pretty clever, and subsequent work explored in
depth efficient algorithms for separating these two subnetworks, but it causes
trouble for users that wish to disconnect their activities from their social-
network identity, and more practically is trivially defeated if the faulty
processes can amass enough bogus social network accounts that they are not
longer an obvious disconnected minority.
15.2 Bitcoin
Since proof-of-work is too expensive, and other approaches are easily defeated,
what do we do if we really want to solve consensus in an open system? It turns
out we bite the bullet and accept the huge cost of proof-of-work. This was the
approach taken by the pseudonymous person or persons Satoshi Nakamoto
in Bitcoin [Nak08]. This system evades some of the issues in the folk theorem
by (a) convincing lots of non-faulty processes to join by including a lottery
awarding tokens to participants and (b) relying on the would-be faulty
processes not to be coordinated enough or have enough available processing
power relative to the huge horde of non-faulty lottery-ticket buyers to target
a specific round of the protocol.
Bitcoin is an implementation of a cryptocurrency, a mechanism for
exchanging cryptographic tokens between users that can be used analogously
to standard currencies. To make all transfers visible thus prevent double-
spending, it implements what is now usually called a distributed ledger
consisting of a chain (sequence) of blocks, each of which contains a set of
transactions that record transfers of tokens between participants. Participants
are identified by cryptographic keys, and a transaction must be signed by
the sender of the tokens to be valid.
A cryptographic hash of the entire ledger is updated with the addition
of each block, to prevent tampering and to construct the key for the proof-
of-work puzzle used to select the next block to be added. This technique,
which gave rise to the name blockchain for systems of this kind, was
originally developed by Haber and Stornetta [HS91], without the proof-of-
work consensus algorithm, as a tool for making it difficult to backdate digital
documents by storing their hashes in a centrally-maintained sequence of
CHAPTER 15. BLOCKCHAINS 128
signed blocks of this type whose full hash is published from time to time in
a difficult-to-corrupt location. (Haber and Stornetta’s company Surety uses
a weekly classified advertisement in the New York Times.)
Bitcoin takes this idea and adds a proof-of-work based consensus protocol
on top, while including side payments to reward participation in the protocol.
The rule for the consensus protocol is that every interested process tries to
extend the current chain as best it can, but only a process that provably solves
a cryptographic puzzle can do so. So the first process to solve the puzzle
wins, and if the majority of the computation power belongs to non-faulty
processes, this process is likely to be non-faulty. In the case of a tie (possibly
created by faulty processes that refuse to admit defeat), longer chain wins.
In this way the computationally-strong majority eventually overcomes the
computationally-weak minority, since even if the minority gets lucky a few
times they are unlikely to win the race against the more powerful faction.
To analyze this, let’s assume a synchronous message-passing system
where messages are distributed through an anonymous broadcast channel.
Synchrony is obtained by assuming roughly-synchronized clocks and setting
a very long timeout of 10 minutes for each round. Because the identities of
processes are not relevant to the protocol, there is no need to identify the
sender of a message, although the proof-of-work mechanism used to select
blocks also has the useful side effect of limiting propagation of spam updates.
In distributed computing terms, Bitcoin implements a replicated state
machine, using a probabilistic version of consensus to choose between pos-
sible extensions. Using randomization evades the Dolev-Strong [DS83] and
FLP [FLP85] lower bounds, because the bad executions constructed in
these bounds are either (a) highly improbable or (b) require the adversary
to predict the future (we’ll come back to this idea in Chapter 24). The
Nakamoto paper does not reference the distributed computing literature,
and its definition of consensus deviates substantially from the traditional
termination-validity-agreement framework of Pease et al. [PSL80]. Instead
of guaranteeing termination and validity, the protocol attempts to provide
an eventual consistency where over time, the copies of the state machine
continuously converge to agreeing on an initial prefix of the operation history
that includes all but a few recently-added blocks.
heuristic analysis that is still pretty sloppy. For a more serious analysis,
see [GKL15], which influenced some of the less suspicious parts of the
discussion below.
Our model is already strong enough to trivially guarantee agreement
in each round: since every non-faulty process sees the same chains in the
broadcast channel, it’s enough to discard any invalid chains (which we will
define soon), and apply some consistent tie-breaking rule to choose among
the remaining valid chains. So the goal of the consensus step will be to
guarantee eventual consistency between rounds, which we will take to
mean that any block buried deep enough in the chain Cr for round r also
appears in any chain Cr0 for r0 > r.
The mechanism for doing this is to generate each Cr+1 as an extension
of Cr . To construct an extension, a process i that wishes to add block xi
must first solve a hash puzzle by finding some y such that h(Cr , xi , y) ≤ D,
where h is a hash function that is sufficiently cryptographically secure that
we can pretend it’s a random function, and D is a difficulty parameter that
can be tuned to adjust the likelihood of finding a solution within the time
bounds associated with the round. If successful, the process can propose
an extension Cr hxi , yi that is valid if it satisfies both application-specific
requirements like xi doesn’t include transactions that spend money the
spender doesn’t have after Cr , and protocol-specific requirements like Cr
is valid and h(Cr , xi , y) ≤ D. These conditions are easily checked by any
process.
For the tie-breaking rule, we will favor longer chains over shorter ones,
and otherwise break ties consistently. As noted previously, consistent tie-
breaking means all non-faulty processes adopt the same value Cr for each
r. To replace a buried block, the faulty processes will need to supply an
alternative chain that wins the tie-breaking rule by being the same length or
longer as the chain built by the non-faulty processes.
The resulting protocol is given in Algorithm 15.1.
The main issue with this protocol is that if the faulty processes get lucky,
they can construct a chain that is longer than the chain of the non-faulty
processes, and use this to hijack the protocol. We’d like to show that when
this happens, the bad chain shares all but a small suffix with the good chain
it displaces. If we are willing to cut a few corners in the argument, this
comes down to demonstrating that the faulty processes can’t win the race to
extend their evil chain past the non-faulty processes’ preferred chain over long
sequences of rounds. We will consider the specific case where the non-faulty
and faulty processes both start off with some common Cr = Čr , and over
the next m rounds the non-faulty processes extend Cr as best they can using
CHAPTER 15. BLOCKCHAINS 130
Algorithm 15.1 while the faulty processes extend Čr in secret. The faulty
processes win if the resulting Čr+m is longer than the non-faulty processes’
Cr+m . (There is a lot of unjustified simplification sneaking in here. For a
much more sophisticated argument that doesn’t cheat, see [GKL15].)
For each process i, let pi be the expected number of puzzle solutions it
finds in a single round. If i is non-faulty, this is just the probability that it
finds a solution, since non-faulty processes stop after finding one solution. If
i is faulty, i can generate more than one solution, which might make pi a bit
larger than it would be for a non-faulty process with the same computational
power. If pi is very small in either case the difference will be slight.
To simplify things, we’ll assume that the set of processes and their pi
values are fixed over time. Let α be the sum of pi over all the non-faulty
processes, and β the sum of pi over all the faulty processes. These give the
expected number of solutions obtained in one round by all non-faulty or
faulty processes respectively.
Inclusion-exclusion says that the probability that the non-faulty processes
solve at least one puzzle in a given round is at least i pi − i6=j pi pj ≥ α−α2 .
P P
Letting Xi be the indicator for the event that the non-faulty processes add a
new block in round r + i, they add at least an expected E [Xi ] ≥ m(α − α2 )
P
blocks in m rounds. We can similarly argue that the faulty processes add
CHAPTER 15. BLOCKCHAINS 131
P
at most an expected mβ = E [Yi ] blocks in m rounds, where Yi is the
indicator variable for success of the i-th puzzle attempt by a non-faulty
process. In both cases we are looking at a sum of 0–1 random variable with
known mean, so Chernoff bounds apply and we get, for any δ,
hX i 2 m(α−α2 )/2
Pr Xi ≤ (1 − δ)m(α − α2 ) ≤ e−δ
hX i 2 mβ/2
Pr Yi ≥ (1 + δ)mβ ≤ e−δ
to get
hX i
Pr Xi ≤ (1 − δ)k = (2/3)k ≤ e−k/18
hX i
Pr Yi ≥ (1 + δ)(k/2) = (2/3)k ≤ e−k/36 .
At the same time, Bitcoin is still absurdly costly, and the guarantees it
provides are not as strong as can be obtained by running iterated Byzantine
agreement on a small number of semi-trusted parties. This may be why more
recent systems have been moving away from proof-of-work, and suggests that
Bitcoin’s unusual status as the first widely-used blockchain may, in the long
run, not save it from being outcompeted by better systems.
Perhaps the way to think about the enormous cost of proof-of-work based
systems is that they are paying a price of anarchy [KP09] for avoiding any
kind of centralized management in the form of a privileged set of servers.
Unfortunately, much of this cost appears to be unavoidable without such
management [PS18].
Part II
Shared memory
133
Chapter 16
Model
134
CHAPTER 16. MODEL 135
1 leftIsDone ← read(leftDone)
2 rightIsDone ← read(rightDone)
3 write(done, leftIsDone ∧ rightIsDone)
to arbitrate which of two near-simultaneous writes gets in last and thus leaves
the long-term value), although it’s also common to assume multi-writer
multi-reader registers, which if not otherwise available can be built from
single-writer multi-reader registers using atomic snapshot (see Chapter 20).
Less common are single-writer single-reader registers, which act much
like message-passing channels except that the receiver has to make an explicit
effort to pick up its mail.
Time Assume that no process takes more than 1 time unit between opera-
tions (but some fast processes may take less). Assign the first operation
in the schedule time 1 and each subsequent operation the largest time
consistent with the bound. The time of the last operation is the time
complexity. This is also known as the big-step or round measure
because the time increases by 1 precisely when every non-faulty process
has taken at least one step, and a minimum interval during which this
occurs counts as a big step or a round.
Total work The total work or total step complexity is just the length
of the schedule, i.e., the number of operations. This doesn’t consider
how the work is divided among the processes, e.g., an O(n2 ) total
work protocol might dump all O(n2 ) operations on a single process
and leave the rest with almost nothing to do. There is usually not
much of a direct correspondence between total work and time. For
example, any algorithm that involves busy-waiting—where a process
repeatedly reads a register until it changes—may have unbounded total
work (because the busy-waiter might spin very fast) even though it
runs in bounded time (because the register gets written to as soon as
CHAPTER 16. MODEL 140
some slower process gets around to it). However, it is trivially the case
that the time complexity is never greater than the total work.
Space Just how big are those registers anyway? Much of the work in this
area assumes they are very big.2 But we can ask for the maximum
number of bits in any one register (width) or the total size (bit
complexity) or number (space complexity) of all registers, and will
2
A typical justification for this assumption is that an arbitrarily-large register can be
simulated by a smaller register that holds pointers to single-use collections of registers
holding the actual values. But even using this technique there are problems for which
individual registers of unbounded size are necessary [DFF+ 23].
CHAPTER 16. MODEL 141
Sticky bits (or sticky registers) With a sticky bit or sticky regis-
ter [Plo89], once the initial empty value is overwritten, all further
writes fail. The writer is not notified that the write fails, but may
be able to detect this fact by reading the register in a subsequent
operation.
Bank accounts Replace the write operation with deposit, which adds a
non-negative amount to the state, and withdraw, which subtracts a
non-negative amount from the state provided the result would not go
below 0; otherwise, it has no effect.
CHAPTER 16. MODEL 142
These solve problems that are hard for ordinary read/write registers under
bad conditions. Note that they all have to return something in response to
an invocation.
There are also blocking objects like locks or semaphores, but these don’t
fit into the RMW framework.
We can also consider generic read-modify-write registers that can compute
arbitrary functions (passed as an argument to the read-modify-write opera-
tion) in the modify step. Here we typically assume that the read-modify-write
operation returns the old value of the register. Generic read-modify-write
registers are not commonly found in hardware but can be easily simulated
(in the absence of failures) using mutual exclusion.3
3
See Chapter 18.
Chapter 17
143
CHAPTER 17. DISTRIBUTED SHARED MEMORY 144
will hold a pair (value, timestamp) where timestamps are (unbounded) integer
values. Initially, everybody starts with (⊥, 0). A process updates its copy
with new values (v, t) upon receiving write(v, t) from any other process p,
provided t is greater than the process’s current timestamp. It then responds
to p with ack(v, t), whether or not it updated its local copy. A process will
also respond to a message read(u) with a response ack(value, timestamp, u);
here u is a nonce3 used to distinguish between different read operations so
that a process can’t be confused by out-of-date acknowledgments.
To write a value, the writer increments its timestamp, updates its value
and sends write(value, timestamp) to all other processes. The write operation
terminates when the writer has received acknowledgments containing the
new timestamp value from a majority of processes.
To read a value, a reader does two steps:
2. It then sends write(v, t) to all processes, and waits for response ack(v, t)
from a majority of the processes. Only then does it return.
(Any extra messages, messages with the wrong nonce, etc., are discarded.)
Both reads and writes cost Θ(n) messages (Θ(1) per process).
Intuition: Nobody can return from a write or a read until they are sure
that subsequent reads will return the same (or a later) value. A process
can only be sure of this if it knows that the values collected by a read will
include at least one copy of the value written or read. But since majorities
overlap, if a majority of the processes have a current copy of v, then the
majority read quorum will include it. Sending write(v, t) to all processes
and waiting for acknowledgments from a majority is just a way of ensuring
that a majority do in fact have timestamps that are at least t.
If we omit the write stage of a read operation, we may violate lineariz-
ability. An example would be a situation where two values (1 and 2, say),
have been written to exactly one process each, with the rest still holding the
initial value ⊥. A reader that observes 1 and (n − 1)/2 copies of ⊥ will return
1, while a reader that observes 2 and (n − 1)/2 copies of ⊥ will return 2. In
the absence of the write stage, we could have an arbitrarily long sequence
3
A nonce is any value that is guaranteed to be used at most once (the term originally
comes from cryptography, which in turn got it from linguistics). In practice, a reader will
most likely generate a nonce by combining its process ID with a local timestamp.
CHAPTER 17. DISTRIBUTED SHARED MEMORY 146
4. none of the other cases applies, and we feel like putting π1 first.
The intent is that we pick some total ordering that is consistent with both
<T and the timestamp ordering (with writes before reads when timestamps
are equal). To make this work we have to show (a) that these two orderings
are in fact consistent, and (b) that the resulting ordering produces values
consistent with an atomic register: in particular, that each read returns the
value of the last preceding write.
Part (b) is easy: since timestamps only increase in response to writes,
each write is followed by precisely those reads with the same timestamp,
which are precisely those that returned the value written.
For part (a), suppose that π1 <T π2 . The first case is when π2 is a read.
Then before the end of π1 , a set S of more than n/2 processes send the π1
process an ack(v1, t1 ) message. Since local timestamps only increase, from
this point on any ack(v2 , t2 , u) message sent by a process in S has t2 ≥ t1 .
Let S 0 be the set of processes sending ack(v2 , t2 , u) messages processed by
π2 . Since |S| > n/2 and |S 0 | > n/2, we have S ∩ S 0 is nonempty and so S 0
CHAPTER 17. DISTRIBUTED SHARED MEMORY 147
3. Send write(v, t) to all processes, and wait for a response ack(v, t) from
a majority of processes.
CHAPTER 17. DISTRIBUTED SHARED MEMORY 148
This increases the cost of a write by a constant factor, but in the end we
still have only a linear number of messages. The proof of linearizability is
essentially the same as for the single-writer algorithm, except now we must
consider the case of two write operations by different processes. Here we have
that if π1 <T π2 , then π1 gets acknowledgments of its write with timestamp
t1 from a majority of processes before π2 starts its initial phase to compute
count. Since π2 waits for acknowledgments from a majority of processes as
well, these majorities overlap, so π2 ’s timestamp t2 must exceed t1 . So the
linearization ordering previously defined still works.
are able to show that read operations can skip the reliable broadcast and
still run in O(n) messages. The details are messy enough that we will not
attempt to reproduce them here; see the cited paper if you are interested.
Chapter 18
Mutual exclusion
150
CHAPTER 18. MUTUAL EXCLUSION 151
crashing without releasing their locks, and with the data structure in some
broken, half-updated state.1
18.2 Goals
(See also [AW04, §4.2], [Lyn96, §10.2].)
Core mutual exclusion requirements:
Mutual exclusion At most one process is in the critical state at a time.
No deadlock (progress) If there is at least one process in a trying state,
then eventually some process enters a critical state; similarly for exiting
and remainder states.
Note that the protocol is not required to guarantee that processes leave
the critical or remainder state, but we generally have to insist that the
processes at least leave the critical state on their own to make progress.
An additional useful property (not satisfied by all mutual exclusion
protocols; see [Lyn96, §10.4)]:
1 oldValue ← read(bit)
2 write(bit, 1)
3 return oldValue
Typically there is also a second reset operation for setting the bit back
to zero. For some implementations, this reset operation may only be used
safely by the last process to get 0 from the test-and-set bit.
Because a test-and-set operation is atomic, if two processes both try to
perform test-and-set on the same bit, only one of them will see a return value
of 0. This is not true if each process simply executes the above code on a
stock atomic register: there is an execution in which both processes read
0, then both write 1, then both return 0 to whatever called the non-atomic
test-and-set subroutine.
Test-and-set provides a trivial implementation of mutual exclusion, shown
in Algorithm 18.1.
1 while true do
// trying
2 while TAS(lock) = 1 do nothing
// critical
3 (do critical section stuff)
// exiting
4 reset(lock)
// remainder
5 (do remainder stuff)
Algorithm 18.1: Mutual exclusion using test-and-set
It is easy to see that this code provides mutual exclusion, as once one
process gets a 0 out of lock, no other can escape the inner while loop until
that process calls the reset operation in its exiting state. It also provides
progress (assuming the lock is initially set to 0); the only part of the code
that is not straight-line code (which gets executed eventually by the fairness
condition) is the inner loop, and if lock is 0, some process escapes it, while if
lock is 1, some process is in the region between the TAS call and the reset
CHAPTER 18. MUTUAL EXCLUSION 153
call, and so it eventually gets to reset and lets the next process in (or itself,
if it is very fast).
The algorithm does not provide lockout-freedom: nothing prevents a
single fast process from scooping up the lock bit every time it goes through
the outer loop, while the other processes ineffectually grab at it just after it
is taken away. Lockout-freedom requires a more sophisticated turn-taking
strategy.
1 while true do
// trying
2 enq(q, myId)
3 while peek(q) 6= myId do nothing
// critical
4 (do critical section stuff)
// exiting
5 deq(q)
// remainder
6 (do remainder stuff)
Algorithm 18.2: Mutual exclusion using a queue
Here the proof of mutual exclusion is that only the process whose ID is at
the head of the queue can enter its critical section. Formally, we maintain an
invariant that any process whose program counter is between the inner while
loop and the call to deq(q) must be at the head of the queue; this invariant
is easy to show because a process can’t leave the while loop unless the test
fails (i.e., it is already at the head of the queue), no enq operation changes
the head value (if the queue is nonempty), and the deq operation (which
does change the head value) can only be executed by a process already at
the head (from the invariant).
CHAPTER 18. MUTUAL EXCLUSION 154
1 while true do
// trying
2 position ← RMW(V, hV.first, V.last + 1i)
// enqueue
3 while RMW(V, V ).first 6= position.last do
4 nothing
// critical
5 (do critical section stuff)
// exiting
6 RMW(V, hV.first + 1, V.lasti)
// dequeue
// remainder
7 (do remainder stuff)
Algorithm 18.3: Mutual exclusion using read-modify-write
1 procedure RMW(f )
2 Enter critical section.
3 q←r
4 r ← f (q)
5 Leave critical section.
6 return q
Algorithm 18.4: Building a concurrent RMW object using mutex
CHAPTER 18. MUTUAL EXCLUSION 156
1. p0 sets present[0] ← 1
2. p0 sets waiting ← 0
shared data:
1 waiting, initially arbitrary
2 present[i] for i ∈ {0, 1}, initially 0
3 Code for process i:
4 while true do
// trying
5 present[i] ← 1
6 waiting ← i
7 while true do
8 if present[¬i] = 0 then
9 break
10 if waiting 6= i then
11 break
// critical
12 (do critical section stuff)
// exiting
13 present[i] = 0
// remainder
14 (do remainder stuff)
Algorithm 18.5: Peterson’s mutual exclusion algorithm for two pro-
cesses
CHAPTER 18. MUTUAL EXCLUSION 158
4. p1 sets present[1] ← 1
5. p1 sets waiting ← 1
6. p1 reads present[0] = 1 and waiting = 1 and loops
7. p0 sets present[0] ← 0
8. p1 reads present[0] = 0 and enters critical section
The idea is that if I see a 0 in your present variable, I know that you
aren’t playing, and can just go in.
Here’s a more interleaved execution where the waiting variable decides
the winner:
1. p0 sets present[0] ← 1
2. p0 sets waiting ← 0
3. p1 sets present[1] ← 1
4. p1 sets waiting ← 1
5. p0 reads present[1] = 1
6. p1 reads present[0] = 1
7. p0 reads waiting = 1 and enters critical section
8. p1 reads present[0] = 1 and waiting = 1 and loops
9. p0 sets present[0] ← 0
10. p1 reads present[0] = 0 and enters critical section
Note that it’s the process that set the waiting variable last (and thus sees
its own value) that stalls. This is necessary because the earlier process might
long since have entered the critical section.
Sadly, examples are not proofs, so to show that this works in general,
we need to formally verify each of mutual exclusion and lockout-freedom.
Mutual exclusion is a safety property, so we expect to prove it using invariants.
The proof in [Lyn96] is based on translating the pseudocode directly into
automata (including explicit program counter variables); we’ll do essentially
the same proof but without doing the full translation to automata. Below,
we write that pi is at line k if it the operation in line k is enabled but has
not occurred yet.
CHAPTER 18. MUTUAL EXCLUSION 159
Lemma 18.5.2. If pi is at Line 12, and p¬i is at Line 8, 10, or 12, then
waiting = ¬i.
Proof. We’ll do the case i = 0; the other case is symmetric. The proof is by
induction on the schedule. We need to check that any event that makes the
left-hand side of the invariant true or the right-hand side false also makes
the whole invariant true. The relevant events are:
shared data:
1 atomic register race, big enough to hold an ID, initially ⊥
2 atomic register door, big enough to hold a bit, initially open
3 procedure splitter(id)
4 race ← id
5 if door = closed then
6 return right
7 door ← closed
8 if race = id then
9 return stop
10 else
11 return down
arrives at a splitter, then (a) at least one process returns right or stop; and
(b) at least one process returns down or stop; (c) at most one process returns
stop; and (d) any process that runs by itself returns stop. The first two
properties will be useful when we consider the problem of renaming in
Chapter 25; we will prove them there. The last two properties are what we
want for mutual exclusion.
The names of the variables race and door follow the presentation in
[AW04, §4.4.5]; Moir and Anderson [MA95], following Lamport [Lam87],
call these X and Y . As in [MA95], we separate out the right and down
outcomes—even though they are equivalent for mutex—because we will need
them later for other applications.
The intuition behind Algorithm 18.6 is that setting door to closed closes
the door to new entrants, and the last entrant to write its ID to race wins
(it’s a slow race), assuming nobody else writes race and messes things up.
The added cost of the splitter is always O(1), since there are no loops.
To reset the splitter, write open to door. This allows new processes to
enter the splitter and possibly return stop.
Lemma 18.5.3. After each time that door is set to open, at most one process
running Algorithm 18.6 returns stop.
Proof. To simplify the argument, we assume that each process calls splitter
at most once.
Let t be some time at which door is set to open (−∞ in the case of the
initial value). Let St be the set of processes that read open from door after
CHAPTER 18. MUTUAL EXCLUSION 162
time t and before the next time at which some process writes closed to door,
and that later return stop by reaching Line 9.
Then every process in St reads door before any process in St writes door.
It follows that every process in St writes race before any process in St reads
race. If some process p is not the last process in St to write race, it will not
see its own ID, and will not return stop. But only one process can be the
last process in St to write race.3
Proof. Follows from examining a solo execution: the process sets race to id,
reads open from door, then reads id from race. This causes it to return stop
as claimed.
shared data:
1 choosing[i], an atomic bit for each i, initially 0
2 number[i], an unbounded atomic register, initially 0
3 Code for process i:
4 while true do
// trying
5 choosing[i] ← 1
6 number[i] ← 1 + maxj6=i number[j]
7 choosing[i] ← 0
8 for j 6= i do
9 loop until choosing[j] = 0
10 loop until number[j] = 0 or hnumber[i], ii < hnumber[j], ji
// critical
11 (do critical section stuff)
// exiting
12 number[i] ← 0
// remainder
13 (do remainder stuff)
Algorithm 18.7: Lamport’s Bakery algorithm
Note that several of these lines are actually loops; this is obvious for
Lines 9 and 10, but is also true for Line 6, which includes an implicit loop to
read all n − 1 values of number[j].
Intuition for mutual exclusion is that if you have a lower number than
I do, then I block waiting for you; for lockout-freedom, eventually I have
the smallest number. (There are some additional complications involving
the choosing bits that we are sweeping under the rug here.) For a real proof
CHAPTER 18. MUTUAL EXCLUSION 164
described in §18.5.1.2 works here too. The result is O(log n) RMRs per
critical section access, but only in the CC model.
1 C[side(i)] ← i
2 T ←i
3 P [i] ← 0
4 rival ← C[¬side(i)]
5 if rival 6= ⊥ and T = i then
6 if P [rival] = 0 then
7 P [rival] = 1
8 while P [i] = 0 do spin
9 if T = i then
10 while P [i] ≤ 1 do spin
then run pk until it covers one more register. If we let p1 . . . pk−1 go, they
overwrite anything pk wrote. Unfortunately, they may not come back to
covering the same registers as before if we rerun the induction hypothesis
(and in particular might cover the same register that pk does). So we have
to look for a particular configuration C1 that not only covers k − 1 registers
but also has an extension that covers the same k − 1 registers.
Here’s how we find it: Start in C. Run the induction hypothesis to get
C1 ; here there is a set W1 of k − 1 registers covered in C1 . Now let processes
p1 through pk−1 do their pending writes, then each enter the critical section,
leave it, and finish, and rerun the induction hypothesis to get to a state C2 ,
indistinguishable from an idle configuration by pk and up, in which k − 1
registers in W2 are covered. Repeat to get sets W3 , W4 , etc. Since this
r
sequence is unbounded, and there are only k−1 distinct sets of registers to
cover (where r is the number of registers), eventually we have Wi = Wj for
some i =6 j. The configurations Ci and Cj are now our desired configurations
covering the same k − 1 registers.
Now that we have Ci and Cj , we run until we get to Ci . We now run pk
until it is about to write some register not covered by Ci (it must do so, or
otherwise we can wipe out all of its writes while it’s in the critical section and
then go on to violate mutual exclusion). Then we let the rest of p1 through
pk−1 do all their writes (which immediately destroys any evidence that pk
ran at all) and run the execution that gets them to Cj . We now have k − 1
registers covered by p1 through pk−1 and a k-th register covered by pk , in a
configuration that is indistinguishable from idle: this proves the induction
step.
The final result follows by the fact that when k = n we cover n registers;
this implies that there are n registers to cover.
It’s worth noting that the execution constructed in this proof might be
very, very long. It’s not clear what happens if we consider executions in
which, say, the critical section is only entered a polynomial number of times.
If we are willing to accept a small probability of failure over polynomially-
many entries, there is a randomized mutual exclusion protocol that uses
O(log n) space [AHTW18], at the cost of O(n) amortized RMR complexity
in the cache-coherent model. It is still open whether it is possible to reduce
the space complexity below O(n) for polynomial-length executions without
allowing for a small probability of failure or without having such high RMR
complexity.
Chapter 19
171
CHAPTER 19. THE WAIT-FREE HIERARCHY 172
19.1.1 Robustness
Whether or not the resulting hierarchy is in fact robust for arbitrary de-
terministic objects is still open, but Ruppert [Rup00] subsequently showed
that it is robust for RMW registers and objects with a read operation that
returns the current state, and there is a paper by Borowsky, Gafni, and
Afek [BGA94] that sketches a proof based on a topological characterization
of computability4 that hrm is robust for deterministic objects that don’t
2
The existence of such objects was eventually demonstrated by Afek, Ellen, and
Gafni [AEG16].
3
The r in hrm stands for the registers, the m for having many objects of the given type.
Jayanti [Jay97] also defines a hierarchy hr1 where you only get finitely many objects. The
h stands for “hierarchy,” or, more specifically, h(T ) stands for the level of the hierarchy at
which T appears [Jay11].
4
See Chapter 29.
CHAPTER 19. THE WAIT-FREE HIERARCHY 173
19.1.2 Initialization
Another useful result from the Borowsky et al.paper [BGA94] mentioned
above is that the consensus number is not generally dependent on what
assumptions we make about the initial state of the objects. Specifically,
[BGA94, Lemma 3.2] states that as long as there is some sequence of oper-
ations that takes an object from a fixed initial state to a desirable initial
state for consensus, then we can safely assume that the object is in the
desirable state. The core idea of the proof is that each process can initialize
its own copy of the object and then announce that it is ready; each process
will then participate in a sequence of consensus protocols using the objects
that they observe are ready, with the output of each protocol used as the
input to the next. Because the first object Si to be announced as initialized
will be visible to all processes, they will all do consensus using Si . Any
subsequent protocols that may be used by only a subset of the processes will
not change the common agreed output from the Si protocol.6 This justifies
our assumption that objects can be initialized to any desired value.
5
Ruppert’s paper is particularly handy because it gives an algorithm for computing
the consensus number of the objects it considers. However, for infinite-state objects, this
requires solving the halting problem (as previously shown by Jayanti and Toueg [JT92]).
6
The result in the paper is stated for a consensus protocol that uses a single copy of the
object, but it generalizes in the obvious way to those that use multiple copies of the object.
CHAPTER 19. THE WAIT-FREE HIERARCHY 174
• x and y are both reads, Then x and y commute: Cxy = Cyx, and we
get a contradiction.
• x is a read and y is a write. Then py can’t tell the difference between
CHAPTER 19. THE WAIT-FREE HIERARCHY 177
• x and y are both writes. Now py can’t tell the difference between Cxy
and Cy, so we get the same decision value for both, again contradicting
that Cx is 0-valent and Cy is 1-valent.
write its preferred value to a register ri , then execute the non-trivial RMW
operation on the RMW object initialized to v. The first process to execute
its operation sees v and decides its own value. The second process sees f (v)
and decides the first process’s value (which it reads from the register).7 It
follows that a non-trivial RMW object has consensus number at least 2.
In many cases, this is all we get. Suppose that the operations of some
RMW type T are non-interfering in a way analogous to the previous definition,
where now we say that x and y commute if they leave the object in the same
state (regardless of what values are returned) and that y overwrites x if the
object is always in the same state after both x and xy (again regardless
of what is returned). The two processes px and py that carry out x and y
know what happened, but a third process pz doesn’t. So if we run pz to
completion we get the same decision value after both Cx and Cy, which
means that Cx and Cy can’t be 0-valent and 1-valent. It follows that no
collection of RMW registers with interfering operations can solve 3-process
consensus, and thus all such objects have consensus number 2. Examples
of these objects include test-and-set bits, fetch-and-add registers, and
swap registers that support an operation swap that writes a new value and
returns the previous value.
There are some other objects with consensus number 2 that don’t fit this
pattern. Define a wait-free queue as an object with enqueue and dequeue
operations (like normal queues), where dequeue returns ⊥ if the queue is
empty (instead of blocking). To solve 2-process consensus with a wait-free
queue, initialize the queue with a single value (it doesn’t matter what the
value is). We can then treat the queue as a non-trivial RMW register where
a process wins if it successfully dequeues the initial value and loses if it gets
empty.8
However, enqueue operations are non-interfering: if px enqueues vx and
py enqueues vy , then any third process can detect which happened first;
similarly we can distinguish enq(x)deq() from deq()enq(x). So to show we
can’t do three process consensus we do something sneakier: given a bivalent
state C with allegedly 0- and 1-valent successors Cenq(x) and Cenq(y),
7
The extra registers are just implementing the standard construction of multivalued
consensus from id-consensus; see §19.1.3.
8
But wait! What if the queue starts empty?
This turns out to be a surprisingly annoying problem, and was one of the motivating
examples for hrm as opposed to Herlihy’s vaguer initial definition.
With one empty queue and nothing else, Jayanti and Toueg [JT92, Theorem 7] show that
there is no solution to consensus for two processes. This is also true for stacks (Theorem 8
from the same paper). But adding a register (Theorem 9) lets you do it. A second empty
queue also works.
CHAPTER 19. THE WAIT-FREE HIERARCHY 179
Algorithm 19.2 requires 2-register writes, and will give us a protocol for 2
processes (since the reader above has to participate somewhere to make the
first case work). For m processes, we can do the same thing with m-register
writes. We have a register rpq = rqp for each pair of distinct processes p
m
and q, plus a register rpp for each p; this gives a total of 2 + m = O(m2 )
registers. All registers are initialized to ⊥. Process p then writes its initial
preference to some single-writer register pref p and then simultaneously writes
p to rpq for all q (including rpp ). It then attempts to figure out the first
writer by applying the above test for each q to rpq (standing in for rshared ),
rpp (r1 ) and rqq (r2 ). If it won against all the other processes, it decides its
own value. If not, it repeats the test recursively for some p0 that beat it until
11
The main issue is that processes can only read the registers one at a time. An
alternative to running Algorithm 19.2 is to use a double-collect snapshot (see §20.1) to
simulate reading all three registers at once. However, this might require as many as twelve
read operations, since a process doing a snapshot has to re-read all three registers if any of
them change.
CHAPTER 19. THE WAIT-FREE HIERARCHY 182
1 v1 ← r1
2 v2 ← r2
3 if v1 = v2 = ⊥ then
4 return no winner
5 if v1 = 1 and v2 = ⊥ then
// p1 went first
6 return 1
// read r1 again
7 v10 ← r1
8 if v2 = 2 and v10 = ⊥ then
// p2 went first
9 return 2
// both p1 and p2 wrote
10 if rshared = 1 then
11 return 2
12 else
13 return 1
Algorithm 19.2: Determining the winner of a race between 2-register
writes. The assumption is that p1 and p2 each wrote their own IDs
to ri and rshared simultaneously. This code can be executed by any
process (including but not limited to p1 or p2 ) to determine which of
these 2-register writes happened first.
CHAPTER 19. THE WAIT-FREE HIERARCHY 183
it finds a process that beat everybody, and returns its value. So m-register
writes solve m-process wait-free consensus.
A further tweak gets 2m−2: run two copies of an (m−1)-process protocol
using separate arrays of registers to decide a winner for each group. Then add
a second phase where processes contend across the groups. This involves each
process p from group 1 writing the winning ID for its group simultaneously
into sp and spq for each q in the other group. The first process to do this will
be the only process that wins against every process in the other group, so
we can pick a winning group by looking for some such process. We can then
return the input value for whichever process won within the winning group.
One thing to note about the second phase is that, unlike mutex, we can’t
just have the winners of the two groups fight each other, since this would
not give the wait-free property for non-winners. Instead, we have to allow a
non-winner p to pick up the slack for a slow winner and fight on behalf of
the entire group. This requires an m-process write operation to write sp and
all spq at once.
Now suppose we have 2m − 1 processes. The first part says that each of
the pending operations (x, y, all of the zi ) writes to 1 single-writer register
and at least k two-writer registers where k is the number of processes leading
to a different univalent value. This gives k + 1 total registers simultaneously
written by this operation. Now observe that with 2m − 1 process, there is
some set of m processes whose operations all lead to a b-valent state; so
for any process to get to a (¬b)-valent state, it must write m + 1 registers
simultaneously. It follows that with only m simultaneous writes we can only
do (2m − 2)-consensus.
Curiously, we can see the last bivalent configuration in the algorithm
given earlier: as long as we have not had any process contend with the
processes in the other group, it is still possible for the winner of either group
to win the overall protocol. If we run each process until it is about to do its
final m-register write, we get exactly the situation where the processes in one
group give exactly m − 1 pending writes that lead to 0-valent configurations
and the processes in the other group give exactly m − 1 pending writes that
lead to 1-valent configurations, with all of these pending writes overlapping
in exactly the way required by the impossibility argument. In principle this
happens for any consensus implementation that is subject to this kind of
bivalence argument, but it is nice to see the structure of the upper bound
and lower bound matching up so directly in this case.
its initial state, since we know that before x or y the configuration is still
bivalent.)
So the m-process consensus object has consensus number m. This shows
that hrm is nonempty at each level.
A natural question at this point is whether the inability of m-process
consensus objects to solve (m + 1)-process consensus implies robustness of the
hierarchy. One might consider the following argument: given any object at
level m, we can simulate it with an m-process consensus object, and since we
can’t combine m-process consensus objects to boost the consensus number,
we can’t combine any objects they can simulate either. The problem here is
that while m-process consensus objects can simulate any object in a system
with m processes (see below), it may be that some objects can do more in a
system with m + 1 objects while still not solving (m + 1)-process consensus.
A simple way to see this would be to imagine a variant of the m-process
consensus object that doesn’t fail completely after m operations; for example,
it might return one of the first two inputs given to it instead of ⊥. This
doesn’t help with solving consensus, but it might (or might not) make it too
powerful to implement using standard m-process consensus objects.
An m-process consensus object is arguably a very artificial way to populate
all levels of the consensus hierarchy. Mostefaoui et al. [MPR18] proposed
m-sliding window registers as a “natural” class of objects that has this
property. An m-sliding window register RWm possesses a write operation
and a read operation that returns the last m values written to the register
in the order they were written.
It’s easy to solve m-process consensus using this object. We assume that
the initial state of the register does not contain any process IDs, and have
each contending process write its ID to the register. The first writer wins.
The proof that an m-sliding window register can’t solve consensus for
m + 1 processes is similar to that for m-process consensus objects. Given
a system consisting of read-write registers and RWm objects, choosing the
bivalent successor of any configuration either works forever or eventually
reaches a configuration C with only univalent successors. By the usual
argument, the m + 1 pending operations in C must all be operations on the
same m-sliding window register.
We can easily show that none of these operations can be read operations.
Suppose x is a read operation such that Cx is b-valent, and let y be any
operation such that Cy is ¬b-valent. Then Cxy and Cy are indistinguishable
to the n − 1 processes that do not execute x, giving a contradiction.
Now let x and y be write operations where Cx is 0-valent and Cy is
1-valent. Let z1 , . . . , zm−1 be the remaining operations enabled in C. Then
CHAPTER 19. THE WAIT-FREE HIERARCHY 186
Cxyz1 . . . zm−1 and Cyz1 . . . zm−1 apply the same last m writes to the sliding
window register, leaving the resulting configurations indistinguishable to all
processes if the process carrying out x takes no more steps.
Mostefaoui et al.observe that taking this argument to the limit shows that
a unbounded distributed ledger has infinite consensus number, which is not
entirely surprising given that such an object is equivalent to fetch-and-cons
(§19.2.3).
1 procedure apply(π)
// announce my intended operation
2 op[i] ← π
3 while true do
// find a recent round
4 r ← maxj round[j]
// obtain the history as of that round
5 if hr = ⊥ then
6 hr ← consensus(c[r], ⊥)
7 if π ∈ hr then
8 return value π returns in hr
// else attempt to advance
9 h0 ← hr
10 for each j do
11 if op[j] 6∈ h0 then
12 append op[j] to h0
Atomic snapshots
We’ve seen in the previous chapter that there are a lot of things we can’t
make wait-free with just registers. But there are a lot of things we can.
Atomic snapshots are a tool that let us do a lot of these things easily.
An atomic snapshot object acts like a collection of n single-writer
multi-reader atomic registers with a special snapshot operation that returns
(what appears to be) the state of all n registers at the same time. This
is easy without failures: we simply lock the whole register file, read them
all, and unlock them to let all the starving writers in. But it gets harder if
we want a protocol that is wait-free, where any process can finish its own
snapshot or write even if all the others lock up.
We’ll give the usual sketchy description of a couple of snapshot algo-
rithms. More details on early snapshot results can be found in [AW04, §10.3]
or [Lyn96, §13.3]. There is also a reasonably recent survey by Fich on upper
and lower bounds for the problem [Fic05].
189
CHAPTER 20. ATOMIC SNAPSHOTS 190
values were present in the registers at some time in between the two collects.
This gives us a very simple algorithm for snapshot. Unfortunately, it doesn’t
terminate if there are a lot of writers around.1 So we need some way to slow
the writers down, or at least get them to do snapshots for us.
20.2.1 Linearizability
We now need to argue that the snapshot vectors returned by the Afek et al.
algorithm really work, that is, that between each matching invoke-snapshot
CHAPTER 20. ATOMIC SNAPSHOTS 192
and respond-snapshot there was some actual time where the registers in the
array contained precisely the values returned in the respond-snapshot action.
We do so by assigning a linearization point to each snapshot vector, a time
at which it appears in the registers (which for correctness of the protocol had
better lie within the interval between the snapshot invocation and response).
For snapshots obtained through case (a), take any time between the two
collects. For snapshots obtained through case (b), take the linearization point
already assigned to the snapshot vector provided by the third write. In the
latter case we argue by induction on termination times that the linearization
point lies inside the snapshot’s interval.
Note that this means that all snapshots were ultimately collected by two
successive collects returning identical values, since any case-(b) snapshot
sits on top of a finite regression of case-(b) snapshots that must end with a
case-(a) snapshot. This means that any snapshot corresponds to an actual
global state of the registers at some point in the execution, which is not true
of all snapshot algorithms. It also means that we can replace the registers in
the snapshot array with other objects that allow us to detect updates (say,
counters or max registers) and still get snapshots.
In an actual execution, the fact that we are waiting for double collects
with no intervening updates means that if there are many writers, eventually
all of them will stall waiting for a case-(a) snapshot to complete. So that
snapshot will complete because all the writers are stuck. In a sense, requiring
writers to do snapshots first almost gives us a form of locking, but without
the vulnerability to failures of a real lock.
two of my collects, I may notice none of them because all the sequence
numbers wrapped around all the way. But we can augment mod-m sequence
numbers with a second handshaking mechanism that detects when a large
enough number of snapshots have occurred; this acts like the guard bit on
an automobile odometer, than signals when the odometer has overflowed
to prevent odometer fraud by just running the odometer forward an extra
million miles or so.
The result is the full version of Afek et al. [AAD+ 93]. (Our presentation
here follows [AW04, 10.3].) The key mechanism for detecting odometer fraud
is a handshake, a pair of single-writer bits used by two processes to signal
each other that they have done something. Call the processes S (for same)
and D (for different), and supposed we have handshake bits hS and hD . We
then provide operations tryHandshake (signal that something is happening)
and checkHandshake (check if something happened) for each process; these
operations are asymmetric. The code is:
1. The toggle bit for some process q is unchanged between the two snap-
shots taken by p. Since the bit is toggled with each update, this means
that an even number of updates to q 0 s segment occurred during the
interval between p’s writes. If this even number is 0, we are happy: no
updates means no call to tryHandshake by q, which means we don’t
see any change in q’s segment, which is good, because there wasn’t any.
If this even number is 2 or more, then we observe that each of these
events precedes the following one:
It follows that q both reads and writes the handshake bits in between
p’s calls to tryHandshake and checkHandshake, so p correctly sees
that q has updated its segment.
2. The toggle bit for q has changed. Then q did an odd number of updates
(i.e., at least one), and p correctly detects this fact.
What does p do with this information? Each time it sees that q has done
a scan, it updates a count for q. If the count reaches 3, then p can determine
that q’s last scanned value is from a scan that is contained completely within
the time interval of p’s scan. Either this is a direct scan, where q actually
performs two collects with no changes between them, or it’s an indirect
scan, where q got its value from some other scan completely contained within
CHAPTER 20. ATOMIC SNAPSHOTS 195
q’s scan. In the first case p is immediately happy; in the second, we observe
that this other scan is also contained within the interval of p’s scan, and so
(after chasing down a chain of at most n − 1 indirect scans) we eventually
reach a direct scan contained within it that provided the actual value. In
either case p returns the value of pair of adjacent collects with no changes
between them that occurred during the execution of its scan operation, which
gives us linearizability.
1 procedure scan()
2 for attempt ← 1 to 2 do
3 Ri ← r ← max(R1 . . . Rn ; Ri + 1)
4 collect ← read(S1 . . . Sn )
5 view ← LAr (collect)
// max computation requires a collect
6 if max(R1 . . . Rn ) ≤ Ri then
7 Vir ← view
8 return Vir
1. All views returned by the scan operation are comparable; that is, there
exists a total order on the set of views (which can be extended to a
total order on scan operations by breaking ties using the execution
order).
3. The total order on views respects the execution order: if π1 and π2 are
scan operations that return v1 and v2 , then π1 <S π2 implies v1 ≤ v2 .
(This gives us linearization.)
Let’s start with comparability. First observe that any view returned
is either a direct view (obtained from LAr ) or an indirect view (obtained
from Vjr for some other process j). In the latter case, following the chain of
indirect views eventually reaches some direct view. So all views returned for
a given round are ultimately outputs of LAr and thus satisfy comparability.
But what happens with views from different rounds? The lattice-
agreement objects only operate within each round, so we need to ensure that
any view returned in round r is included in any subsequent rounds. This is
where checking round numbers after calling LAr comes in.
Suppose some process i returns a direct view; that is, it sees no higher
round number in either its first attempt or its second attempt. Then at
the time it starts checking the round number in Line 6, no process has yet
written a round number higher than the round number of i’s view (otherwise
CHAPTER 20. ATOMIC SNAPSHOTS 199
i would have seen it). So no process with a higher round number has yet
executed the corresponding collect operation. When such a process does
so, it obtains values that are at least as large as those fed into LAr , and i’s
round-r view is less than or equal to the vector of these values by upward
validity of LAr , and thus less than or equal to the vector of values returned
by LAr0 for r0 > r, by downward validity of LAr0 . So we have comparability
of all direct views, which implies comparability of all indirect views as well.
To show that each view returned by a scan includes any preceding update,
we observe that either a process returns its first-try scan (which includes
the update by downward validity) or it returns the results of a scan in the
second-try round (which includes the update by downward validity in the
later round, since any collect in the second-try round starts after the update
occurs). So no updates are missed.
Now let’s consider two scan operations π1 and π2 where π1 precedes π2
in the execution. We want to show that, for the views v1 and v2 that these
scans return, v1 ≤ v2 . Pick some time between when π1 finishes and π2
starts, and let s be the contents of the registers at this time. Then v1 ≤ s by
upward validity, since any input fed to a lattice agreement object before π1
finishes was collected from a register whose value was no greater than it is in
s. Similarly, s ≤ v2 by downward validity, because v2 is at least as large as
the collect value read by π2 , and this is at least as large as s. So v1 ≤ s ≤ v2 .
ensure that out-of-date smaller sets don’t overwrite larger ones at any node,
and the cost of using this data structure and carrying out the double-collect
snapshot at a node with m leaves below it is shown to be O(m). So the total
cost of a snapshot is O(n + n/2 + n/4 + . . . 1) = O(n), giving the linear time
bound.
Let’s now look at the details of this protocol. There are two main
components: the Union algorithm used to compute a new value for each
node of the tree, and the ReadSet and WriteSet operations used to store the
data in the node. These are both rather specialized algorithms and depend
on the details of the other, so it is not trivial to describe them in isolation
from each other; but with a little effort we can describe exactly what each
component demands from the other, and show that it gets it.
The Union algorithm does the usual two-collects-without change trick to
get the values of the children and then stores the result. In slightly more
detail:
3. If the values obtained are the same in both collects, call WriteSet on
the current node to store the union of the two sets and proceed to the
parent node. Otherwise repeat the preceding step.
1 procedure WriteSet(S)
2 for i ← |S| down to 1 do
3 a[i] ← S
4 procedure ReadSet()
// update p to last nonempty position
5 while true do
6 s ← a[p]
7 if p = m or a[p + 1] = ∅ then
8 break
9 else
10 p←p+1
11 return s
Algorithm 20.4: Increasing set data structure
Naively, one might think that we could just write directly to a[|S|] and
skip the previous ones, but this makes it harder for a reader to detect that
a[|S|] is occupied. By writing all the previous registers, we make it easy to
tell if there is a set of size |S| or bigger in the sequence, and so a reader can
start at the beginning and scan forward until it reaches an empty register,
secure in the knowledge that no larger value has been written.4 Since we
4
This trick of reading in one direction and writing in another dates back to a paper by
Lamport from 1977 [Lam77].
CHAPTER 20. ATOMIC SNAPSHOTS 202
want to guarantee that no reader every spends more that O(m) operations
on an array of m registers (even if it does multiple calls to ReadSet), we also
have it remember the last location read in each call to ReadSet and start
there again on its next call. For WriteSet, because we only call it once, we
don’t have to be so clever, and can just have it write all |S| ≤ m registers.
We need to show linearizability. We’ll do so by assigning a specific
linearization point to each high-level operation. Linearize each call to ReadSet
at the last time that it reads a[p]. Linearize each call to WriteSet(S) at the
first time at which a[|S|] = S and a[i] 6= ∅ for every i < |S| (in other words,
at the first time that some reader might be able to find and return S); if
there is no such time, linearize the call at the time at which it returns. Since
every linearization point is inside its call’s interval, this gives a linearization
that is consistent with the actual execution. But we have to argue that it
is also consistent with a sequential execution, which means that we need
to show that every ReadSet operation returns the largest set among those
whose corresponding WriteSet operations are linearized earlier.
Let R be a call to ReadSet and W a call to WriteSet(S). If R returns S,
then at the time that R reads S from a[|S|], we have that (a) every register
a[i] with i < |S| is non-empty (otherwise R would have stopped earlier), and
(b) |S| = m or a[|S| + 1] = ∅ (as otherwise R would have kept going after
later reading a[|S| + 1]. From the rule for when WriteSet calls are linearized,
we see that the linearization point of W precedes this time and that the
linearization point of any call to WriteSet with a larger set follows it. So
the return value of R is consistent.
The payoff: unless we do more updates than snapshots, don’t want to
assume multi-writer registers, are worried about unbounded space, have a
beef with huge registers, or care about constant factors, it costs no more
time to do a snapshot than a collect. So in theory we can get away with
assuming snapshots pretty much wherever we need them.
algorithm for a single scanner (i.e., only one process can do snapshots) in
which each updater maintains two copies of its segment, a high copy (that
may be more recent than the current scan) and a low copy (that is guaranteed
to be no more recent than the current scan). The idea is that when a scan is
in progress, updaters ensure that the values in memory at the start of the
scan are not overwritten before the scan is completed, by copying them to
the low registers, while the high registers allow new values to be written
without waiting for the scan to complete. Unbounded sequence numbers,
generated by the scanner, are used to tell which values are recent or not.
As long as there is only one scanner, nothing needs to be done to ensure
that all scans are consistent, and indeed the single-scanner algorithm can be
implemented using only atomic registers. But extending the algorithm to
multiple scanners is tricky. A simple approach would be to keep a separate
low register for each concurrent scan—however, this would require up to n
low registers and greatly increase the cost of an update. Instead, the authors
devise a mechanism, called a coordinated collect, that allows the scanners
collectively to implement a sequence of virtual scans that do not overlap.
Each virtual scan is implemented using the single-scanner algorithm, with its
output written to a common view array that is protected from inconsistent
updates using LL/SC operations (CAS also works). A scanner participates
in virtual scans until it obtains a virtual scan that is useful to it (this means
that the virtual scan has to take place entirely within the interval of the
process’s actual scan operation); the simplest way to arrange this is to have
each scanner perform two virtual scans and return the value obtained by the
second one.
The paper puts a fair bit of work into ensuring that only O(n) view
arrays are needed, which requires handling some extra special cases where
particularly slow processes don’t manage to grab a view before it is reallocated
for a later virtual scan. We avoid this complication by simply assuming an
unbounded collection of view arrays; see the paper for how to do this right.
A more recent paper by Fatourou and Kallimanis [FK07] gives improved
time and space complexity using the same basic technique.
LL/SC.
A call to scan copies the first of memory[j].high or memory[j].low that
has a sequence number less than the current sequence number. Pseudocode
is given as Algorithm 20.5.
1 procedure scan()
2 currSeq ← currSeq + 1
3 for j ← 0 to n − 1 do
4 h ← memory[j].high
5 if h.seq < currSeq then
6 view[j] ← h.value
7 else
8 view[j] ← memory[j].low.value
1 procedure update()
2 seq ← currSeq
3 h ← memory[i].high
4 if h.seq 6= seq then
5 memory[i].low ← h
6 memory[i].high ← (value, seq)
Algorithm 20.6: Single-scanner snapshot: update
read it), and won’t get it from memory[i].low either (because the value
that is in memory[i].high will have seq < currSeq, and so S will take
that instead).
20.5 Applications
Here we describe a few things we can do with snapshots.
20.5.2 Counters
Given atomic snapshots, it’s easy to build a counter (supporting increment,
decrement, and read operations); or, in more generality, a generalized counter
(supporting increments by arbitrary amounts); or, in even more generality,
an object supporting any collection of commutative and associative update
operations (as long as these operations don’t return anything). The idea
is that each process stores in its segment the total of all operations it has
performed so far, and a read operation is implemented using a snapshot
followed by summing the results. This is a case where it is reasonable
to consider multi-writer registers in building the snapshot implementation,
because there is not necessarily any circularity in doing so.
Lower bounds on
perturbable objects
209
CHAPTER 21. LOWER BOUNDS ON PERTURBABLE OBJECTS 210
value (like a write, but it may also return some information about the old
state). The point of historyless objects is that covering arguments work for
them: if there is a process with a pending update operations on some object,
the adversary can use it at any time to wipe out the state of the object and
hide any previous operations from any process except the updater (who, in
a typical covering argument, is quietly killed to keep it from telling anybody
what it saw).
Atomic registers are a common example of a historyless object: the read
never changes the state, and the write always replaces it. Swap objects
(with a swap operation that writes a new state while returning the old state)
are the canonical example, since they can implement any other historyless
object (and even have consensus number 2, showing that even extra consensus
power doesn’t necessarily help here). Test-and-sets (which are basically one-
bit swap objects where you can only swap in 1) are also historyless. In
contrast, anything that looks like a counter or similar object where the new
state is a combination of the old state and the operation is not historyless.
This is important because many of these objects turn out to be perturbable,
and if they were also historyless, we’d get a contradiction.
Below is a sketch of the proof. See the original paper [JTT00] for more
details.
The basic idea is to build a sequence of executions of the form Λk Σk π,
where Λk is a preamble consisting of various complete update operations and
k incomplete update operations by processes p1 through pn−1 , Σk delivers k
delayed writes from the incomplete operations in Λk , and π is a operation
by pn that returns some information about the object that is affected by
previous operations. To make our life easier, we’ll assume that π performs
only read steps.1
We’ll expand Λk Σk to Λk+1 Σk+1 by inserting new operations in between
Λk and Σk , and argue that because those operations can change the value
returned by π, one of them must write an object not covered in Σk , which
will (after some more work) allow us to cover yet another object.
In order for these covered objects to keep accumulating, the reader has
to keep looking at them. To a first approximation, this means that we want
1
The idea is that if π does anything else, then the return values of other steps can
be simulated by doing a read in place of the first step and using the property of being
historyless to compute the return values of subsequent steps. There is still a possible
objection that we might have some historyless objects that don’t even provide read steps.
The easiest way to work around this is to assume that our objects do in fact provide a read
step, because taking the read step away isn’t going to make implementing the candidate
perturbable object any easier.
CHAPTER 21. LOWER BOUNDS ON PERTURBABLE OBJECTS 211
• For a max register, let γ include a bigger write than all the others.
Restricted-use objects
213
CHAPTER 22. RESTRICTED-USE OBJECTS 214
1 procedure read(r)
2 if switch = 0 then
3 return 0 : read(left)
4 else
5 return 1 : read(right)
The intuition is that the max register is really a big tree of switch
variables, and we store a particular bit-vector in the max register by setting
to 1 the switches needed to make read follow the path corresponding to
that bit-vector. The procedure for writing 0x tests switch first, because once
switch gets set to 1, any 0x values are smaller than the largest value, and we
don’t want them getting written to left where they might confuse particularly
slow readers into returning a value we can’t linearize. The procedure for
writing 1x sets switch second, because (a) it doesn’t need to test switch, since
1x always beats 0x, and (b) it’s not safe to send a reader down into right
until some value has actually been written there.
It’s easy to see that read and write operations both require exactly
one operation per bit of the value read or written. To show that we get
linearizability, we give an explicit linearization ordering (see the paper for a
full proof that this works):
(a) Within this pile, we sort operations using the linearization ordering
for left.
(a) Within this pile, operations that touch right are ordered using
the linearization ordering for right. Operations that don’t (which
are the “do nothing” writes for 0x values) are placed consistently
with the actual execution order.
CHAPTER 22. RESTRICTED-USE OBJECTS 216
To show that this gives a valid linearization, we have to argue first that
any read operation returns the largest earlier write argument and that we
don’t put any non-concurrent operations out of order.
For the first part, any read in the 0 pile returns 0 : read(left), and
read(left) returns (assuming left is a linearizable max register) the largest
value previously written to left, which will be the largest value linearized
before the read, or the all-0 vector if there is no such value. In either case
we are happy. Any read in the 1 pile returns 1 : read(right). Here we have
to guard against the possibility of getting an all-0 vector from read(right)
if no write operations linearize before the read. But any write operation
that writes 1x doesn’t set switch to 1 until after it writes to right, so no read
operation ever starts read(right) until after at least one write to right has
completed, implying that that write to right linearizes before the read from
right. So in all the second-pile operations linearize as well.
Here is the snapshot-based method: if each process writes its own contri-
bution to the max register to a single-writer register, then we can read the
max register by taking a snapshot and returning the maximum value. (It is
not hard to show that this is linearizable.) This gives an unbounded max
register with read and write cost O(n). So by choosing this in preference
to the balanced tree when m is large, the cost of either operation on a max
register is min (dlg me , O(n)).
We can combine this with the unbalanced tree by terminating the right
path with a snapshot-based max register. This gives a cost for reads and
writes of values v of O(min(log v, n)).
We’ve shown the recurrence T (m, n) ≥ mint (max(T (t, n), T (m−t, n)))+1,
with base cases T (1, n) = 0 and T (m, 1) = 0. The solution to this recurrence
is exactly min(dlg me , n − 1), which is the same, except for a constant factor
on n, as the upper bound we got by choosing between a balanced tree for
small m and a snapshot for m ≥ 2n−1 . For small m, the recursive split we
get is also the same as in the tree-based algorithm: call the r register switch
and you can extract a tree from whatever algorithm somebody gives you. So
this says that the tree-based algorithm is (up to choice of the tree) essentially
the unique optimal bounded max register implementation for m ≤ 2n−1 .
It is also possible to show lower bounds on randomized implementations
of max registers and other restricted-use objects. See [AACH12, ACAH16,
HK14] for examples.
and a[1] as the tail. We’ll show how to construct such an object recursively
from smaller objects of the same type, analogous to the construction of an
m-valued max register (which we can think of as a m × 1 max array). The
idea is to split head into two pieces left and right as before, while representing
tail as a master copy stored in a max register at the top of the tree plus
cached copies at every internal node. These cached copies are updated by
readers at times carefully chosen to ensure linearizability.
The base of the construction is an `-valued max register r, used directly
as a 1 × ` max array; this is the case where the head component is trivial and
we only need to store a.tail = r. Here calling write(a[0], v) does nothing,
while write(a[1], v) maps to write(r, v), and read(a) returns h0, read(r)i.
For larger values of k, paste a kleft × ` max array left and a kright × ` max
array right together to get a (kleft + kright ) × ` max array. This construction
uses a switch variable as in the basic construction, along with an `-valued
max register tail that is used to store the value of a[1].
Calls to write(a[0], v) and read(a) follow the structure of the correspond-
ing operations for a simple max register, with some extra work in read to
make sure that the value in tail propagates into left and right as needed to
ensure the correct value is returned.
A call to write(a[1], v) operation writes tail directly, and then calls
read(a) to propagate the new value as well.1
Pseudocode is given in Algorithm 22.3.
The individual step complexity of each operation is easily computed.
Assuming a balanced tree, write(a[0], v) takes exactly dlg ke steps, while
write(a[1], v) costs exactly dlg `e steps plus the cost of read(a). Read
operations are more complicated. In the worst case, we have two reads of
a.tail and a write to a.right[1] at each level, plus up to two operations on
a.switch, for a total cost of at most (3dlg ke − 1)(dlg `e + 2) = O(log k log `)
steps. This dominates other costs in write(a[1], v), so the asymptotic cost
of both write and read operations is O(log k log `).
In the special case where k = `, both writes and reads have their step
complexities squared compared to a single-component k-valued max register.
22.6.1 Linearizability
In broad outline, the proof of linearizability follows the proof for a simple
max register. But as with snapshots, we have to show that the ordering of
1
This call to read(a) was omitted in the original published version of the algo-
rithm [AACHE15], but was added in an erratum by the authors [AACHE18]. Without it,
the implementation can violate linearizability in some executions.
CHAPTER 22. RESTRICTED-USE OBJECTS 220
1 procedure write(a[i], v)
2 if i = 0 then
3 if v < kleft then
4 if a.switch = 0 then
5 write(a.left[0], v)
6 else
7 write(a.right[0], v − kleft )
8 a.switch ← 1
9 else
10 write(a.tail, v)
11 read(a)
12 procedure read(a)
13 x ← read(a.tail)
14 if a.switch = 0 then
15 write(a.left[1], x)
16 return read(a.left)
17 else
18 x ← read(a.tail)
19 write(a.right[1], x)
20 return hkleft , 0i + read(a.right)
cms
5 cmr
bmr
br
ar
0 5 a
ms
0 0 3 3 mr
m
c
b 1 2
a
s
m r
Figure 22.1: Snapshot from max arrays; taken from [AACHE15, Fig. 2]
Common2
225
CHAPTER 23. COMMON2 226
1 procedure TAS2()
2 if Consensus2(myId) = myId then
3 return 0
4 else
5 return 1
Once we have test-and-set for two processes, we can easily get one-shot
swap for two processes. The trick is that a one-shot swap object always
returns ⊥ to the first process to access it and returns the other process’s value
to the second process. We can distinguish these two roles using test-and-set
and add a register to send the value across. Pseudocode is in Algorithm 23.2.
1 procedure swap(v)
2 a[myId] = v
3 if TAS2() = 0 then
4 return ⊥
5 else
6 return a[¬myId]
objects may only work for two specific processes). A process drops out if
it ever sees a 1. We can easily show that at most one process leaves each
subtree with all zeros, including the whole tree itself.
Unfortunately, this process does not give a linearizable test-and-set object.
It is possible that p1 loses early to p2 , but then p3 starts (elsewhere in the
tree) after p1 finishes, and races to the top, beating out p2 . To avoid this,
we can follow [AWW93] and add a gate bit that locks out latecomers.1
The resulting construction looks something like Algorithm 23.3. This
gives a slightly different interface from straight TAS; instead of returning 0
for winning and 1 for losing, the algorithm returns ⊥ if you win and the id
of some process that beats you if you lose.2 It’s not hard to see that this
gives a linearizable test-and-set after translating the values back to 0 and 1
(the trick for linearizability is that any process that wins saw an empty gate,
and so started before any other process finished). It also sorts the processes
into a rooted tree, with each process linearizing after its parent (this latter
claim is a little trickier, but basically comes down to a loser linearizing after
the process that defeated it either on gate or on one of the TAS2 objects).
This algorithm is kind of expensive: the losers that drop out early are
relatively lucky, but the winning process has to win a TAS2 against everybody,
for a total of Θ(n) TAS operations. We can reduce the cost to O(log n) if
our TAS2 objects allow arbitrary processes to execute them. This is done,
for example, in the RatRace test-and-set implementation of Alistarh et
al. [AAG+ 10], using a randomized implementation of TAS2 due to Tromp
and Vitányi [TV02] (see §25.5.2).
1
The original version of this trick is from an earlier paper [AGTV92], where the gate
bit is implemented as an array of single-writer registers.
2
Note that this process may also be a loser, just one that made it further up the tree
than you did. We can’t expect to learn the ID of the ultimate winner, because that would
solve n-process consensus.
CHAPTER 23. COMMON2 228
1 procedure compete(i)
// check the gate
2 if gate 6= ⊥ then
3 return gate
4 gate ← i
// Do tournament, returning id of whoever I lose to
5 node ← leaf for i
6 while node 6= root do
7 for each j whose leaf is below sibling of node do
8 if TAS2(t[i, j]) = 1 then
9 return j
10 node ← node.parent
// I win!
11 return ⊥
Algorithm 23.3: Tournament algorithm with gate
1 procedure swap(v)
2 i←0
3 while true do
// Look for a starting point
4 while TAS(si ) = 1 do
5 i←i+1
6 vi ← v
// Check if we’ve been blocked
7 if TAS(ti ) = 0 then
// We win, find our predecessor
8 for j ← i − 1 down to 0 do
9 if TAS(tj ) = 1 then
// Use this value
10 return vj
and a max register accessed that keeps track of the largest position accessed
so far.
AMW implement accessed using a snapshot, which we will do as well to
avoid complications from trying to build a max register out of an infinitely
deep tree.4 Note that AMW don’t call this data structure a max register,
but we will, because we like max registers.
Code for the swap procedure is given in Algorithm 23.5.
To show Algorithm 23.5 works, we need the following technical lemma,
which, among other things, implies that node 1 − 2depth is always available
to be captured by the process at depth depth. This is essentially just a
restatement of Lemma 1 from [AMW11].
Lemma 23.4.1. For any x = k/2q , where k is odd, no process attempts to
capture any y ∈ [x, x + 1/2q ) before some process writes x to accessed.
Proof. Suppose that the lemma fails, let y = `/2r be the first node captured
in violation of the lemma, and let x = k/2q be such that y ∈ [x, x + 1/2q )
but x has not been written to accessed when y is captured. Let p be the
process that captures y.
Now consider y 0 = x − 1/2r , the last node to the left of x at the same
depth as y. Why didn’t p capture y 0 ?
One possibility is that some other process p0 blocked y 0 during its return
phase. This p0 must have captured a node z > y 0 . If z > y, then p0 would
have blocked y first, preventing p from capturing it. So y 0 < z < y.
The other possibility is that p never tried to capture y 0 , because some
other process p0 wrote some value z > y 0 to accessed first. This value z must
also be less than y (or else p would not have tried to capture y).
In both cases, there is a process p0 that captures a value z with y 0 < z < y,
before p captures y and thus before anybody writes x to accessed.
Since y 0 < x and y 0 < z, either y 0 < z < x or y 0 < x < z. In the first case,
z ∈ [y 0 , y 0 + 1/2r ) is captured before y 0 is written to accessed. In the second
case z ∈ [x, x + 1/2q ) is captured before x is written to accessed. Either
way, y is not the first capture to violate the lemma, contradicting our initial
assumption.
Using Lemma 23.4.1, it is straightforward to show that Algorithm 23.5 is
wait-free. If I get q for my value of depth, then no process will attempt to
4
The issue is not so much that we can’t store arbitrary dyadics, since we can encode them
using an order-preserving prefix-free code, but that, without some sort of helping mechanism,
a read running concurrently with endlessly increasing writes (e.g. 1/2, 3/4, 7/8, . . . ) might
not be wait-free. Plus as soon as the denominator exceeds 2n , which happens after only n
calls to swap, O(n)-step snapshots are cheaper anyway.
CHAPTER 23. COMMON2 232
1 procedure swap(v)
// Pick a new row just for me
2 depth ← fetchAndIncrement(maxDepth)
// Capture phase
3 repeat
// Pick leftmost node in my row greater than accessed
n o
4 cap ← min x x = k/2depth for odd k, x > accessed
// Post my value
5 reg[cap] ← v
// Try to capture the test-and-set
6 win ← TAS(tst[cap]) = 0
7 writeMax(accessed, cap)
8 until win
// Return phase
// Max depth reached by anybody left of cap
9 maxPreviousDepth ← read(maxDepth)
10 ret ← cap
// Block previous nodes until we find one we can take
11 repeat
12 ret ← max {x = k/2q | q ≤ maxPreviousDepth, k odd, x < ret}
13 if ret < 0 then
14 return ⊥
15 until TAS(tst[ret]) = 1
16 return reg[ret]
Algorithm 23.5: Wait-free swap from test-and-set [AMW11]
CHAPTER 23. COMMON2 233
of Jayanti [Jay98].
The lower bound applies a fortiori to the case where we don’t have
LL/SC or CAS and have to rely on 2-process consensus objects. But it’s not
out of the question that there is a matching upper bound in this case.
235
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET236
24.2 History
The use of randomization to solve consensus in an asynchronous system
with crash failures was proposed by Ben-Or [BO83] for a message-passing
model. Chor, Israeli, and Li [CIL94] gave the first wait-free consensus
protocol for a shared-memory system, which assumed a particular kind of
weak adversary. Abrahamson [Abr88] defined strong and weak adversaries
and gave the first wait-free consensus
2 protocol for a strong adversary; its
expected step complexity was Θ 2n . After failing to show that exponential
time was necessary, Aspnes and Herlihy [AH90a] showed how to do consensus
in O(n4 ) total step complexity, a value that was soon reduced to O(n2 log n)
by Bracha and Rachman [BR91]. This remained the best known bound for
the strong-adversary model until Attiya and Censor [AC08] showed matching
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET238
Θ(n2 ) upper and lower bounds on total step complexity. A later paper by
Aspnes and Censor [AC09] showed that it was also possible to get an O(n)
bound on individual step complexity.
For weak adversaries, the best known upper bound on individual step
complexity was O(log n) for a long time [Cha96, Aum97, Asp12b], with
an O(n) bound on total step complexity for some models [Asp12b]. More
recent work has lowered the individual step complexity bound to O(log log n),
under the assumption of an oblivious adversary [Asp12a]. No non-trivial
lower bound on expected individual step complexity is known, although
there is a known lower bound on the distribution of the individual step
complexity [ACH10].
In the following sections, we will concentrate on the more recent weak-
adversary algorithms. These have the advantage of being fast enough that
one might reasonably consider using them in practice, assuming that the
weak-adversary assumption does not create trouble, and they are also require
less probabilistic machinery to analyze than the strong-adversary algorithms.
1 preference ← input
2 for r ← 1 . . . ∞ do
3 (b, preference) ← AdoptCommit(AC[r], preference)
4 if b = commit then
5 return preference
6 else
7 do something to generate a new preference
The idea is that the adopt-commit takes care of ensuring that once
somebody returns a value (after receiving commit), everybody else who
doesn’t return adopts the same value (follows from coherence). Conversely,
if everybody already has the same value, everybody returns it (follows from
convergence). The only missing piece is the part where we try to shake all
the processes into agreement. For this we need a separate object called a
conciliator.
24.3.2 Conciliators
Conciliators are a weakened version of randomized consensus that replace
agreement with probabilistic agreement: the processes can disagree some-
times, but must agree with constant probability despite interference by the
adversary. An algorithm that satisfies termination, validity, and probabilistic
agreement is called a conciliator.1
1
Warning: This name has not really caught on in the general theory-of-distributed-
computing community, and so far only appears in papers that have a particular researcher
as a co-author [Asp12a, AE11, Asp12b, AACV17]. Unfortunately, there doesn’t seem to
be a better name for the same object that has caught on. So we are stuck with it for now.
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET240
shared data:
binary registers r0 and r1 , initially 0;
weak shared coin sharedCoin
1 procedure coinCoinciliator()
2 rv ← 1
3 if r¬v = 1 then
4 return sharedCoin()
5 else
6 return v
This still leaves the problem of how to build a shared coin. In the
message-passing literature, the usual approach is to use cryptography,2 but
because we are assuming an arbitrarily powerful adversary, we can’t use
cryptography.
If we don’t care how small δ gets, we could just have each process flip its
own local coin and hope that they all come up the same. (This is more or
less what was done by Abrahamson [Abr88].) But that might take a while. If
we aren’t willing to wait exponentially long, a better approach is to combine
many individual local coins using some sort of voting.
A version of this approach, based on a random walk, was used by Aspnes
and Herlihy [AH90a] to get consensus in (bad) polynomial expected time
against an adaptive adversary. A better version was developed by Bracha
2
For example, Canetti and Rabin [CR93] solved Byzantine agreement in O(1) time by
building a shared coin on top of secret sharing.
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET242
value will agree with the first process; the only way that this can’t happen is
if some process writes a different value to the register before it notices the
first write.
The random choice of whether to write the register or not avoids this
problem. The idea is that even though the adversary can schedule a write at
a particular time, because it’s oblivious, it won’t be able to tell if the process
wrote (or was about to write) or did a no-op instead.
The basic version of this algorithm, due to Chor, Israeli, and Li [CIL94],
1
uses a fixed 2n probability of writing to the register. So once some process
writes to the register, the chance that any of the remaining n − 1 processes
write to it before noticing that it’s non-null is at most n−12n < 1/2. It’s also
not hard to see that this algorithm uses O(n) total operations, although it
may be that one single process running by itself has to go through the loop
2n times before it finally writes the register and escapes.
Using increasing probabilities avoids this problem, because any process
that executes the main loop dlg ne + 1 times will write the register. This
establishes the O(log n) per-process bound on operations. At the same time,
an O(n) bound on total operations still holds, since each write has at least
1
a 2n chance of succeeding. The price we pay for the improvement is that
we increase the chance that an initial value written to the register gets
overwritten by some high-probability write. But the intuition is that the
probabilities can’t grow too much, because the probability that I write on
my next write is close to the sum of the probabilities that I wrote on my
previous writes—suggesting that if I have a high probability of writing next
time, I should have done a write already.
Formalizing this intuition requires a little bit of work. Fix the schedule,
and let pi be the probability that the i-th write operation in this schedule
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET244
succeeds. Let t be the least value for which ti=1 pi ≥ 1/4. We’re going to
P
argue that with constant probability one of the first t writes succeeds, and
that the next n − 1 writes by different processes all fail.
The probability that none of the first t writes succeed is
t t
e−pi
Y Y
(1 − pi ) ≤
i=1 i=1
t
!
X
= exp pi
i=1
≤ e−1/4 .
Now observe that if some process p writes at or before the t-th write,
then any process q with a pending write either did no writes previously, or
its last write was among the first t − 1 writes, whose probabilities sum to
P 1
less than 1/4. In either case, q has a i∈Sq pi + 2n chance of writing on
its pending attempt, where Sq is the set of indices in 1 . . . t − 1 where q
previously attempted to write.
Summing up these probabilities over all processes gives a total of n−1 2n +
P P
p ≤ 1/2 + 1/4 = 3/4. So with probability at least e −1/4 (1 − 3/4) =
q i∈Sq i
e−1/4 /4, we get agreement.
24.7 Sifters
A faster conciliator can be obtained using a sifter, which is a mechanism for
rapidly discarding processes using randomization [AA11] while keeping at
least one process around. The simplest sifter has each process either write a
register (with low probability) or read it (with high probability); all writers
and all readers that see ⊥ continue to the next stage of the protocol, while
all readers who see a non-null value drop out. If the probability of writing
√
is tuned carefully, this will reduce n processes to at most 2 n processes on
average; by iterating this mechanism, the expected number of remaining
processes can be reduced to 1 + after O(log log n + log(1/)) phases.
As with previous implementations of test-and-set (see Algorithm 23.3),
it’s often helpful to have a sifter return not only that a process lost but which
process it lost to. This gives the implementation shown in Algorithm 24.5.
To use a sifter effectively, p should be tuned to match the number of
processes that are likely to use it. This is because of the following lemma:
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET245
1 procedure sifter(p, r)
2 with probability p do
3 r ← id
4 return ⊥
5 else
6 return r
Lemma 24.7.1. Fix p, and let X processes executed a sifter with parameter
p. Let Y be the number of processes for which the sifter returns ⊥. Then
1
E [X | Y ] ≤ pX + . (24.7.1)
p
Proof. Let X be the number of survivors after dlg lg ne+dlog4/3 (7/)e rounds
of sifters, with probabilities tuned as described above. We’ve shown that
E [X] ≤ 1 + , so E [X − 1] ≤ . Since X − 1 ≥ 0, from Markov’s inequality
we have Pr [X ≥ 2] = Pr [X − 1 ≥ 1] ≤ E [X − 1] /1 ≤ .
1 if gate 6= ⊥ then
2 return 1
3 else
4 gate ← myId
5 for i ← 1 . . . dlog log ne + dlog
4/3
(7 log n)e do
−i+1
6 with probability min 1/2, 21−2 do
7 ri ← myId
8 else
9 w ← ri
10 if w 6= ⊥ then
11 return 1
each process writes down at the start of the protocol all of the coin-flips it
intends to use to decide whether to read or write at each round of sifting.
Together with its input, these coin-flips make up the process’s persona.
In analyzing the progress of the sifter, we count surviving personae (with
multiple copies of the same persona counting as one) instead of surviving
processes.
Pseudocode for this algorithm is given in Algorithm 24.7. Note that the
loop body is essentially the same as the code in Algorithm 24.5, except that
the random choice is replaced by a lookup in persona.chooseWrite.
To show that this works, we need to argue that having multiple copies
of a persona around doesn’t change the behavior of the sifter. In each
round, we will call the first process with a given persona p to access ri
the representative of p, and argue that a persona survives round i in
this algorithm precisely when its representative would survive round i in
a corresponding test-and-set sifter with the schedule restricted only to the
representatives.
There are three cases:
1. The representative of p writes. Then at least one copy of p survives.
2. The representative of p reads a null value. Again at least one copy of
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET248
1 procedure conciliator(input)
2 Let R = dlog log ne + dlog4/3 (7/)e
3 Let chooseWrite be a vector of R independent random Boolean
variables with Pr[chooseWrite[i] = 1] = pi , where
−i+1 −i
pi = 21−2 (n)−2 for i ≤ dlog log ne and pi = 1/2 for larger i.
4 persona ← hinput, chooseWrite, myIdi
5 for i ← 1 . . . R do
6 if persona.chooseWrite[i] = 1 then
7 ri ← persona
8 else
9 v ← ri
10 if v 6= ⊥ then
11 persona ← v
12 return persona.input
Algorithm 24.7: Sifting conciliator (from [Asp12a])
p survives.
From the preceding analysis for test-and-set, we have that after O(log log n+
log 1/) rounds with appropriate probabilities of writing, at most 1 + values
survive on average. This gives a probability of at most of disagreement. By
alternating these conciliators with adopt-commit objects, we get agreement
in O(log log n + log m/ log log m) expected time, where m is the number of
possible input values.
I don’t think the O(log log n) part of this expression is optimal, but I
don’t know how to do better.
that (if you don’t care about space) it doesn’t have any parameters that
require tuning to n: this means that exactly the same structure can be used
in each round. An unfortunate feature is that it’s not possible to guarantee
that every process that leaves learns the identity of a process that stays: this
means that it can’t adapted into a consensus protocol using the persona trick
described in §24.7.2.
Pseudocode is given in Algorithm 24.8. In this simplified version, we
assume an infinitely long array A[1 . . . ], so that we don’t need to worry
about n. Truncating the array at log n also works, but the analysis requires
handling the last position as a special case, which I am too lazy to do here.
Proof. For the first part, observe that any process that picks the largest
value of r among all processes will survive; since the number of processes is
finite, there is at least one such survivor.
For the second part, let Xi be the number of survivors with r = i. Then
E [Xi ] is bounded by n · 2−i , since no process survives with r = i without
first choosing r = i. But we can also argue that E [Xi ] ≤ 3 for any value of
n, by considering the sequence of write operations in the execution.
Because the adversary is oblivious, the location of these writes is uncor-
related with their ordering. If we assume that the adversary is trying to
maximize the number of survivors, its best strategy is to allow each process
to read immediately after writing, as delaying this read can only increase the
probability that A[r + 1] is nonzero. So in computing Xi , we are counting
the number of writes to A[i] before the first write to A[i + 1]. Let’s ignore
all writes to other registers; then the j-th write to either of A[i] or A[i + 1]
has a conditional probability of 2/3 of landing on A[i] and 1/3 on A[i + 1].
CHAPTER 24. RANDOMIZED CONSENSUS AND TEST-AND-SET250
because once n · 2−i drops below 3, the remaining terms form a geometric
series.
Renaming
25.1 Renaming
In the renaming problem, we have n processes, each starts with a name
from some huge namespace, and we’d like to assign them each unique names
from a much smaller namespace. The main application is allowing us to run
algorithms that assume that the processes are given contiguous numbers,
e.g., the various collect or atomic snapshot algorithms in which each process
is assigned a unique register and we have to read all of the registers. With
renaming, instead of reading a huge pile of registers in order to find the few
that are actually used, we can map the processes down to a much smaller
set.
Formally, we have a decision problem where each process has input xi
(its original name) and output yi , with the requirements:
Uniqueness If pi 6= pj , then yi 6= yj .
252
CHAPTER 25. RENAMING 253
Anonymity The code executed by any process depends only on its input
xi : for any execution of processes p1 . . . pn with inputs x1 . . . xn , and
any permutation π of [1 . . . n], there is a corresponding execution of
processes pπ(1) . . . pπ(n) with inputs x1 . . . xn in which pπ(i) performs
exactly the same operations as pi and obtains the same output yi .
25.2 Performance
Conventions on counting processes:
1 procedure getName()
2 s←1
3 while true do
4 a[i] ← s
5 view ← snapshot(a)
6 if view[j] = s for some j then
7 r ← |{j : view[j] 6= ⊥ ∧ j ≤ i}|
8 s ← r-th positive integer not in
{view[j] : j 6= i ∧ view[j] = ⊥}
9 else
10 return s
The array a holds proposed names for each process (indexed by the
original names), or ⊥ for processes that have not proposed a name yet. If a
process proposes a name and finds that no other process has proposed the
same name, it takes it; otherwise it chooses a new name by first computing
its rank r among the active processes and then choosing the r-th smallest
name that hasn’t been proposed by another process. Because the rank is at
most n and there are at most n − 1 names proposed by the other processes,
this always gives proposed names in the range [1 . . . 2n − 1]. But it remains
to show that the algorithm satisfies uniqueness and termination.
For uniqueness, consider two process with original names i and j. Suppose
that i and j both decide on s. Then i sees a view in which a[i] = s and
a[j] 6= s, after which it no longer updates a[i]. Similarly, j sees a view in
CHAPTER 25. RENAMING 256
which a[j] = s and a[i] 6= s, after which it no longer updates a[j]. If i’s view
is obtained first, then j can’t see a[i] 6= s, but the same holds if j’s view is
obtained first. So in either case we get a contradiction, proving uniqueness.
Termination is a bit trickier. Here we argue that no process can run
forever without picking a name, by showing that if we have a set of processes
that are doing this, the one with smallest original name eventually picks a
name. More formally, call a process trying if it runs for infinitely many steps
without choosing a name. Then in any execution with at least one trying
process, eventually we reach a configuration where all processes have either
finished or are trying. In some subsequent configuration, all the processes
have written to the a array at least once; from this point on, the set of non-
null positions in a—and thus the rank each process computes for itself—is
stable.
Starting from some such stable configuration, look at the trying process
i with the smallest original name, and suppose it has rank r. Let F =
{z1 < z2 . . . } be the set of “free names” that are not proposed in a by any
of the finished processes. Observe that no trying process j 6= i ever proposes
a name in {z1 . . . zr }, because any such process has rank greater than r.
This leaves zr open for i to claim, provided the other names in {z1 . . . zr }
eventually become free. But this will happen, because only trying processes
may have proposed these names (early on in the execution, when the finished
processes hadn’t finished yet), and the trying processes eventually propose
new names that are not in this range. So eventually process i proposes zr ,
sees no conflict, and finishes, contradicting the assumption that it is trying.
Note that we haven’t proved any complexity bounds on this algorithm at
all, but we know that the snapshot alone takes at least Ω(N ) time and space.
Brodksy et al. [BEW11] cite a paper of Bar-Noy and Dolev [BND89] as giving
a shared-memory version of [ABND+ 90] with complexity O(n · 4n ); they also
give algorithms and pointers to algorithms with much better complexity.
1 procedure releaseName()
2 a[i] ← ⊥
Algorithm 25.2: Releasing a name
process always makes progress in getName. It may be, however, that there
is some process that never successfully obtains a name, because it keeps
getting stepped on by other processes zipping in and out of getName and
releaseName.
25.4.3.1 Splitters
The Moir-Anderson renaming protocol uses a network of splitters, which
we last saw providing a fast path for mutual exclusion in §18.5.2. Each
splitter is a widget, built from a pair of atomic registers, that assigns to
each processes that arrives at it the value right, down, or stop. As discussed
previously, the useful properties of splitters are that if at least one process
arrives at a splitter, then (a) at least one process returns right or stop; and
(b) at least one process returns down or stop; (c) at most one process returns
stop; and (d) any process that runs by itself returns stop.
We proved the last two properties in §18.5.2; we’ll prove the first two here.
Another way of describing these properties is that of all the processes that
arrive at a splitter, some process doesn’t go down and some process doesn’t
go right. By arranging splitters in a grid, this property guarantees that every
row or column that gets at least one process gets to keep it—which means
that with k processes, no process reaches row k + 1 or column k + 1.
Algorithm 25.3 gives the implementation of a splitter (it’s identical to
Algorithm 18.6, but it will be convenient to have another copy here).
Lemma 25.4.1. If at least one process completes the splitter, at least one
process returns stop or right.
Proof. Suppose no process returns right; then every process sees open in
door, which means that every process writes its ID to race before any process
CHAPTER 25. RENAMING 258
shared data:
1 atomic register race, big enough to hold an ID, initially ⊥
2 atomic register door, big enough to hold a bit, initially open
3 procedure splitter(id)
4 race ← id
5 if door = closed then
6 return right
7 door ← closed
8 if race = id then
9 return stop
10 else
11 return down
closes the door. Some process writes its ID last: this process will see its own
ID in race and return stop.
Lemma 25.4.2. If at least one process completes the splitter, at least one
process returns stop or down.
Proof. First observe that if no process ever writes to door, then no process
completes the splitter, because the only way a process can finish the splitter
without writing to door is if it sees closed when it reads door (which must
have been written by some other process). So if at least one process finishes,
at least one process writes to door. Let p be any such process. From the
code, having written door, it has already passed up the chance to return
right; thus it either returns stop or down.
(by Lemma 18.5.3); we also have to show that if at most m processes enter
the grid, every process stops at some splitter.
The argument for this is simple. Suppose some process p leaves the
grid on one of the 2m output wires. Look at the path it takes to get there
(see Figure 25.2, also taken from [Asp10]). Each splitter on this path must
handle at least two processes (or p would have stopped at that splitter, by
Lemma 18.5.4). So some other process leaves on the other output wire, either
right or down. If we draw a path from each of these wires that continues
right or down to the end of the grid, then at every step along this path
we either have a process stop or continue in this same direction as long as
there is a process left to do so. This means that on each of these m disjoint
paths, either some splitter stops a process, or some process reaches a final
output wire, each of which is at a distinct splitter. But this gives m distinct
processes in addition to p, for a total of m + 1 processes. It follows that:
If we don’t know k in advance, we can still guarantee names of size O(k 2)
by carefully arranging them so that each k-by-k subgrid contains the first k2
names. This gives an adaptive renaming algorithm (although the namespace
size is pretty high). We still have to choose our grid to be large enough for
the largest k we might actually encounter; the resulting space complexity is
O(n2 ).
With a slightly more clever arrangement of the splitters, it is possible to
reduce the space complexity to O(n3/2 ) [Asp10]. Whether further reductions
are possible is an open problem. Note however that linear time complexity
makes splitter networks uncompetitive with much faster randomized algo-
rithms (as we’ll see in §25.5), so this may not be a very important open
problem.
snapshot algorithm, we can’t say much about the time complexity of this
combined one. Moir and Anderson suggest instead using an O(N k 2 ) algo-
rithm of Borowsky and Gafni to get O(k 4 ) time for the combined algorithm.
This is close to the best known: a later paper by Afek and Merritt [AM99]
holds the current record for deterministic adaptive renaming into 2k − 1
names at O(k 2 ) individual steps. On the lower bound side, it is known that
Ω(k) is a lower bound on the individual steps of any renaming protocol with
a polynomial output namespace [AAGG11].
(we can’t hope for less than O(n2 ) because of the birthday paradox). But
we want all processes to be guaranteed to have unique names, so we need
some more machinery.
We also need the processes to have initial names; if they don’t, there is al-
ways some nonzero probability that two identical processes will flip their coins
in exactly the same way and end up with the same name. This observation
was formalized by Buhrman, Panconesi, Silvestri, and Vitányi [BPSV06].
known.
There is a rather weak lower bound in the Alistarh et al. paper that shows
that Ω(log log n) steps are needed for some process in the worst case, under
the assumption that the renaming algorithm uses only test-and-set objects
and that a process acquires a name as soon as it wins some test-and-set
object. This does not give a lower bound on the problem in general, and
indeed the renaming-network based algorithms discussed previously do not
have this property. So the question of the exact complexity of randomized
loose renaming is still open.
Chapter 26
Software transactional
memory
Last updated 2011. Some material may be out of date. If you are interested
in software transactional memory from a theoretical perspective, there is a
more recent survey on this material by Attiya [Att14], available at http:
// www. eatcs. org/ images/ bulletin/ beatcs112. pdf .
268
CHAPTER 26. SOFTWARE TRANSACTIONAL MEMORY 269
26.1 Motivation
Some selling points for software transactional memory:
On the other hand, we now have to deal with the possibility that opera-
tions may fail. There is a price to everything.
increasing order means that I have to know which locks I want before
I acquire any of them, which may rule out dynamic transactions.
1 if LL(status) = ⊥ then
2 if LL(r) = oldValue then
3 if SC(status, ⊥) = true then
4 SC(r, newValue)
to the RMW, and an array oldValues[] of old values at these addresses (for
the R part of the RMW). These are all initialized by the initiator of the
transaction, who will be the only process working on the transaction until it
starts acquiring locks.
1. Initialize the record rec for the transaction. (Only the initiator does
this.)
Note that only an initiator helps; this avoids a long chain of helping and
limits the cost of each attempted transaction to the cost of doing two full
transactions, while (as shown below) still allowing some transaction to finish.
26.4 Improvements
One downside of the Shavit and Touitou protocol is that it uses LL/SC very
aggressively (e.g., with overlapping LL/SC operations) and uses non-trivial
(though bounded, if you ignore the ever-increasing version numbers) amounts
of extra space. Subsequent work has aimed at knocking these down; for
example a paper by Harris, Fraser, and Pratt [HFP02] builds multi-register
CAS out of single-register CAS with O(1) extra bits per register. The proof
of these later results can be quite involved; Harris et al., for example, base
their algorithm on an implementation of 2-register CAS whose correctness
has been verified only by machine (which may be a plus in some views).
26.5 Limitations
There has been a lot of practical work on STM designed to reduce overhead
on real hardware, but there’s still a fair bit of overhead. On the theory side,
CHAPTER 26. SOFTWARE TRANSACTIONAL MEMORY 274
a lower bound of Attiya, Hillel, and Milani [AHM09] shows that any STM
system that guarantees non-interference between non-overlapping RMW
transactions has the undesirable property of making read-only transactions
as expensive as RMW transactions: this conflicts with the stated goals
of many practical STM implementations, where it is assumed that most
transactions will be read-only (and hopefully cheap). So there is quite a bit
of continuing research on finding the right trade-offs.
Chapter 27
Obstruction-freedom
Last updated 2011. Some material may be out of date. In particular: §27.3 has
not been updated to include some more recent results [ACHS16, GHHW13];
and §27.4 mostly follows the conference version [FHS05] of the Ellen-Hendler-
Shavit paper and omits stronger results from the journal version [EHS12].
275
CHAPTER 27. OBSTRUCTION-FREEDOM 276
27.2 Examples
27.2.1 Lock-free implementations
Pretty much anything built using compare-and-swap or LL/SC ends up
being lock-free. A simple example would be a counter, where an increment
operation does
1 x ← LL(C)
2 SC(C, x + 1)
1 x←0
2 while true do
3 δ ← x − a[1 − i]
4 if δ = 2 (mod 5) then
5 return 0
6 else if δ = −1 (mod 5) do
7 return 1
8 else
9 x ← (x + 1) mod 5
10 a[i] ← x
1 procedure rightPush(v)
2 while true do
3 k ← oracle(right)
4 prev ← a[k − 1]
5 next ← a[k]
6 if prev.value 6= RN and next.value = RN then
7 if CAS(a[k − 1], prev, [prev.value, prev.version + 1]) then
8 if CAS(a[k], next, [v, next.version + 1]) then
9 we win, go home
10 procedure rightPop()
11 while true do
12 k ← oracle(right)
13 cur ← a[k − 1]
14 next ← a[k]
15 if cur.value 6= RN and next.value = RN then
16 if cur.value = LN and A[k − 1] = cur then
17 return empty
18 else if CAS(a[k], next, [RN, next.version + 1]) do
19 if CAS(a[k − 1], cur, [RN, cur.version + 1]) then
20 return cur.value
only if neither register was modified between the preceding read and the
CAS. If both registers are unmodified at the time of the second CAS, then
the two CAS operations act like a single two-word CAS, which replaces the
previous values (top, RN) with (top, value) in rightPush or (top, value) with
(top, RN) in rightPop; in either case the operation preserves the invariant.
So the only way we get into trouble is if, for example, a rightPush does a
CAS on a[k − 1] (verifying that it is unmodified and incrementing the version
number), but then some other operation changes a[k − 1] before the CAS on
a[k]. If this other operation is also a rightPush, we are happy, because it
must have the same value for k (otherwise it would have failed when it saw
a non-null in a[k − 1]), and only one of the two right-pushes will succeed
in applying the CAS to a[k]. If the other operation is a rightPop, then it
can only change a[k − 1] after updating a[k]; but in this case the update to
a[k] prevents the original right-push from changing a[k]. With some more
tedious effort we can similarly show that any interference from leftPush or
leftPop either causes the interfering operation or the original operation to
fail. This covers 4 of the 16 cases we need to consider. The remaining cases
will be brushed under the carpet to avoid further suffering.
don’t know what this ratio is.3 In particular, if I can execute more than R
steps without you doing anything, I can reasonably conclude that you are
dead—the semisynchrony assumption thus acts as a failure detector.
The fact that R is unknown might seem to be an impediment to using
this failure detector, but we can get around this. The idea is to start with
a small guess for R; if a process is suspected but then wakes up again, we
increment the guess. Eventually, the guessed value is larger than the correct
value, so no live process will be falsely suspected after this point. Formally,
this gives an eventually perfect (♦P ) failure detector, although the algorithm
does not specifically use the failure detector abstraction.
To arrange for a solo execution, when a process detects a conflict (because
its operation didn’t finish quickly), it enters into a “panic mode” where pro-
cesses take turns trying to finish unmolested. A fetch-and-increment register
is used as a timestamp generator, and only the process with the smallest
timestamp gets to proceed. However, if this process is too sluggish, other
processes may give up and overwrite its low timestamp with ∞, temporarily
ending its turn. If the sluggish process is in fact alive, it can restore its low
timestamp and kill everybody else, allowing it to make progress until some
other process declares it dead again.
The simulation works because eventually the mechanism for detecting
dead processes stops suspecting live ones (using the technique described
above), so the live process with the winning timestamp finishes its operation
without interference. This allows the next process to proceed, and eventually
all live processes complete any operation they start, giving the wait-free
property.
The actual code is in Algorithm 27.3. It’s a rather long algorithm but
most of the details are just bookkeeping.
The preamble before entering PANIC mode is a fast-path computation
that allows a process that actually is running in isolation to skip testing
any timestamps or doing any extra work (except for the one register read of
PANIC). The assumption is that the constant B is set high enough that any
process generally will finish its operation in B steps without interference. If
there is interference, then the timestamp-based mechanism kicks in: we grab
a timestamp out of the convenient fetch-and-add register and start slugging
it out with the other processes.
(A side note: while the algorithm as presented in the paper assumes
a fetch-and-add register, any timestamp generator that delivers increasing
3
This is a much older model, which goes back to a famous paper of Dwork, Lynch, and
Stockmeyer [DLS88].
CHAPTER 27. OBSTRUCTION-FREEDOM 283
1 if ¬PANIC then
2 execute up to B steps of the underlying algorithm
3 if we are done then return
4 PANIC ← true // enter panic mode
5 myTimestamp ← fetchAndIncrement()
6 A[i] ← 1 // reset my activity counter
7 while true do
8 T [i] ← myTimestamp
9 minTimestamp ← myTimestamp; winner ← i
10 for j ← 1 . . . n, j 6= i do
11 otherTimestamp ← T [j]
12 if otherTimestamp < minTimestamp then
13 T [winner] ← ∞ // not looking so winning any more
14 minTimestamp ← otherTimestamp; winner ← j
15 else if otherTimestamp < ∞ do
16 T [j] ← ∞
17 if i = winner then
18 repeat
19 execute up to B steps of the underlying algorithm
20 if we are done then
21 T [i] ← ∞
22 PANIC ← false
23 return
24 else
25 A[i] ← A[i] + 1
26 PANIC ← true
27 until T [i] = ∞
28 repeat
29 a ← A[winner]
30 wait a steps
31 winnerTimestamp ← T [winner]
32 until a = A[winner] or winnerTimestamp 6= minTimestamp
33 if winnerTimestamp = minTimestamp then
34 T [winner] ← ∞ // kill winner for inactivity
values over time will work. So if we want to limit ourselves to atomic registers,
we could generate timestamps by taking snapshots of previous timestamps,
adding 1, and appending process IDs for tie-breaking.)
Once I have a timestamp, I try to knock all the higher-timestamp processes
out of the way (by writing ∞ to their timestamp registers). If I see a smaller
timestamp than my own, I’ll drop out myself (T [i] ← ∞), and fight on behalf
of its owner instead. At the end of the j loop, either I’ve decided I am the
winner, in which case I try to finish my operation (periodically checking T [i]
to see if I’ve been booted), or I’ve decided somebody else is the winner, in
which case I watch them closely and try to shut them down if they are too
slow (T [winner] ← ∞). I detect slow processes by inactivity in A[winner];
similarly, I signal my own activity by incrementing A[i]. The value in A[i]
is also used as an increasing guess for the time between increments of A[i];
eventually this exceeds the R(B + O(1)) operations that I execute between
incrementing it.
We still need to prove that this all works. The essential idea is to show
that whatever process has the lowest timestamp finishes in a bounded number
of steps. To do so, we need to show that other processes won’t be fighting it
in the underlying algorithm. Call a process active if it is in the loop guarded
by the “if i = winner” statement. Lemma 1 from the paper states:
Lemma 27.3.1 ([FLMS05, Lemma 1]). If processes i and j are both active,
then T [i] = ∞ or T [j] = ∞.
Proof. Assume without loss of generality that i last set T [i] to myTimestamp
in the main loop after j last set T [j]. In order to reach the active loop, i
must read T [j]. Either T [j] = ∞ at this time (and we are done, since only j
can set T [j] < ∞), or T [j] is greater than i’s timestamp (or else i wouldn’t
think it’s the winner). In the second case, i sets T [j] = ∞ before entering
the active loop, and again the claim holds.
The next step is to show that if there is some process i with a minimum
timestamp that executes infinitely many operations, it increments A[i] in-
finitely often (thus eventually making the failure detector stop suspecting it).
This gives us Lemma 2 from the paper:
Lemma 27.3.2 ([FLMS05, Lemma 2]). Consider the set of all processes that
execute infinitely many operations without completing an operation. Suppose
this set is non-empty, and let i hold the minimum timestamp of all these
processes. Then i is not active infinitely often.
CHAPTER 27. OBSTRUCTION-FREEDOM 285
Proof. Suppose that from some time on, i is active forever, i.e., it never
leaves the active loop. Then T [i] < ∞ throughout this interval (or else i
leaves the loop), so for any active j, T [j] = ∞ by the preceding lemma. It
follows that any active T [j] leaves the active loop after B + O(1) steps of j
(and thus at most R(B + O(1)) steps of i). Can j re-enter? If j’s timestamp
is less than i’s, then j will set T [i] = ∞, contradicting our assumption. But
if j’s timestamp is greater than i’s, j will not decide it’s the winner and
will not re-enter the active loop. So now we have i alone in the active loop.
It may still be fighting with processes in the initial fast path, but since i
sets PANIC every time it goes through the loop, and no other process resets
PANIC (since no other process is active), no process enters the fast path after
some bounded number of i’s steps, and every process in the fast path leaves
after at most R(B + O(1)) of i’s steps. So eventually i is in the loop alone
forever—and obstruction-freedom means that it finishes its operation and
leaves. This contradicts our initial assumption that i is active forever.
So now we want to argue that our previous assumption that there exists
a bad process that runs forever without winning leads to a contradiction, by
showing that the particular i from Lemma 27.3.2 actually finishes (note that
Lemma 27.3.2 doesn’t quite do this—we only show that i finishes if it stays
active long enough, but maybe it doesn’t stay active).
Suppose i is as in Lemma 27.3.2. Then i leaves the active loop infinitely
often. So in particular it increments A[i] infinitely often. After some finite
number of steps, A[i] exceeds the limit R(B + O(1)) on how many steps some
other process can take between increments of A[i]. For each other process j,
either j has a lower timestamp than i, and thus finishes in a finite number of
steps (from the premise of the choice of i), or j has a higher timestamp than
i. Once we have cleared out all the lower-timestamp processes, we follow the
same logic as in the proof of Lemma 27.3.2 to show that eventually (a) i sets
T [i] < ∞ and PANIC = true, (b) each remaining j observes T [i] < ∞ and
PANIC = true and reaches the waiting loop, (c) all such j wait long enough
(since A[i] is now very big) that i can finish its operation. This contradicts
the assumption that i never finishes the operation and completes the proof.
27.3.1 Cost
If the parameters are badly tuned, the potential cost of this construction is
quite bad. For example, the slow increment process for A[i] means that the
time a process spends in the active loop even after it has defeated all other
processes can be as much as the square of the time it would normally take
CHAPTER 27. OBSTRUCTION-FREEDOM 286
27.4.1 Contention
A limitation of real shared-memory systems is that physics generally won’t
permit more than one process to do something useful to a shared object
at a time. This limitation is often ignored in computing the complexity of
a shared-memory distributed algorithm (and one can make arguments for
ignoring it in systems where communication costs dominate update costs in
the shared-memory implementation), but it is useful to recognize it if we
can’t prove lower bounds otherwise. Complexity measures that take the cost
of simultaneous access into account go by the name of contention.
The particular notion of contention used in the Ellen et al. paper is an
adaptation of the contention measure of Dwork, Herlihy, and Waarts [DHW97].
4
The result first appeared in FOCS in 2005 [FHS05], with a small but easily fixed bug in
the definition of the class of objects the proof applies to. We’ll use the corrected definition
from the journal version.
CHAPTER 27. OBSTRUCTION-FREEDOM 287
The idea is that if I access some shared object, I pay a price in memory
stalls for all the other processes that are trying to access it at the same time
but got in first. In the original definition, given an execution of the form
Aφ1 φ2 . . . φk φA0 , where all operations φi are applied to the same object as φ,
and the last operation in A is not, then φk incurs k memory stalls. Ellen et
al. modify this to only count sequences of non-trivial operations, where an
operation is non-trivial if it changes the state of the object in some states
(e.g., writes, increments, compare-and-swap—but not reads). Note that this
change only strengthens the bound they eventually prove, which shows that
in the worst case, obstruction-free implementations of operations on objects
in a certain class incur a linear number of memory stalls (possibly spread
across multiple base objects).
1. φ is an instance of Op executed by p,
2. no operation in A or A0 is executed by p,
then there exists a sequence of operations Q by q such that for every sequence
HφH 0 where
So this definition includes both the fact that p incurs k stalls and some
other technical details that make the proof go through. The fact that p
incurs k stalls follows from observing that it incurs |Sj | stalls in each segment
σj , since all processes in Sj access Oj just before p does.
Note that the empty execution is a 0-stall execution (with i = 0) by the
definition. This shows that a k-stall execution exists for some k.
Note also that the weird condition is pretty strong: it claims not only
that there are no non-trivial operation on O1 . . . Oi in τ , but also that there
are no non-trivial operations on any objects accessed in σ1 . . . σi , which may
include many more objects accessed by p.6
We’ll now show that if a k-stall execution exists, for k ≤ n − 2, then a
(k +k 0 )-stall execution exists for some k 0 > 0. Iterating this process eventually
produces an (n − 1)-stall execution.
Start with some k-stall execution Eσ1 . . . σi . Extend this execution by
a sequence of operations σ in which p runs in isolation until it finishes its
operation φ (which it may start in σ if it hasn’t done so already), then each
process in S runs in isolation until it completes its operation. Now linearize
the high-level operations completed in Eσ1 . . . σi σ and factor them as AφA0
as in the definition of class G.
Let q be some process not equal to p or contained in any Sj (this is where
we use the assumption k ≤ n − 2). Then there is some sequence of high-level
operations Q of q such that Hφ does not return the same value as Aφ for
any interleaving HH 0 of Q with the sequences of operations in AA0 satisfying
the conditions in the definition. We want to use this fact to shove at least
one more memory stall into Eσ1 . . . σi σ, without breaking any of the other
conditions that would make the resulting execution a (k + k 0 )-stall execution.
6
And here is where I screwed up in class on 2011-11-14, by writing the condition as the
weaker requirement that nobody touches O1 . . . Oi .
CHAPTER 27. OBSTRUCTION-FREEDOM 291
27.4.4 Consequences
We’ve just shown that counters and snapshots have (n − 1)-stall executions,
because they are in the class G. A further, rather messy argument (given in
the Ellen et al. paper) extends the result to stacks and queues, obtaining a
slightly weaker bound of n total stalls and operations for some process in
CHAPTER 27. OBSTRUCTION-FREEDOM 293
the worst case.7 In both cases, we can’t expect to get a sublinear worst-case
bound on time under the reasonable assumption that both a memory stall
and an actual operation takes at least one time unit. This puts an inherent
bound on how well we can handle hot spots for many practical objects, and
means that in an asynchronous system, we can’t solve contention at the
object level in the worst case (though we may be able to avoid it in our
applications).
But there might be a way out for some restricted classes of objects. We saw
in Chapter 22 that we could escape from the Jayanti-Tan-Toueg [JTT00] lower
bound by considering bounded objects. Something similar may happen here:
the Fich-Herlihy-Shavit bound on fetch-and-increments requires executions
with n(n − 1)d + n increments to show n − 1 stalls for some fetch-and-
increment if each fetch-and-increment only touches d objects, and even for
d = log n this is already superpolynomial. The max-register construction
of a counter [AACH12] doesn’t help here, since everybody hits the switch
bit at the top of the max register, giving n − 1 stalls if they all hit it at the
same time. But there might be some better construction that avoids this.
7
This is out of date: Theorem 6.2 of [EHS12] gives a stronger result than what’s in
[FHS05].
Chapter 28
BG simulation
294
CHAPTER 28. BG SIMULATION 295
for its input value v. At some point during the execution of the protocol, the
process receives a notification safei , followed later (if the protocol finishes)
by a second notification agreei (v 0 ) for some output value v 0 . It is guaranteed
that the protocol terminates as long as all processes continue to take steps
until they receive the safe notification, and that the usual validity (all
outputs equal some input) and agreement (all outputs equal each other)
conditions hold. There is also a wait-free progress condition that the safei
notices do eventually arrive for any process that doesn’t fail, no matter what
the other processes do (so nobody gets stuck in their unsafe section).
Pseudocode for a safe agreement object is given in Algorithm 28.1. This
is a translation of the description of the algorithm in [BGLR01], which is
specified at a lower level using I/O automata.1
// proposei (v)
1 A[i] ← hv, 1i
2 if snapshot(A) contains hj, 2i for some j 6= i then
// Back off
3 A[i] ← hv, 0i
4 else
// Advance
5 A[i] ← hv, 2i
// safei
6 repeat
7 s ← snapshot(A)
8 until s does not contain hj, 1i for any j
// agreei
9 return s[j].value where j is smallest index with s[j].level = 2
Algorithm 28.1: Safe agreement (adapted from [BGLR01])
loop before this, and guarantees termination if all processes leave their unsafe
interval, because no process can then wait forever for the last 1 to disappear.
To show agreement, observe that at least one process advances to level 2
(because the only way a process doesn’t is if some other process has already
advanced to level 2), so any process i that terminates observes a snapshot s
that contains at least one level-2 tuple and no level-1 tuples. This means
that any process j whose value is not already at level 2 in s can at worst
reach level 1 after s is taken. But then j sees a level-2 tuples and backs
off. It follows that any other process i0 that takes a later snapshot s0 that
includes no level-1 tuples sees the same level-2 tuples as i, and computes the
same return value. (Validity also holds, for the usual trivial reasons.)
1. Make an initial guess for sjr by taking a snapshot of A and taking the
value with the largest round number for each component A[−][k].
2. Initiate the safe agreement protocol Sjr using this guess. It continues
to run Sjr until it leaves the unsafe interval.
4. If Sjr terminates, compute a new value vjr for j to write based on the
simulated snapshot returned by Sjr , and update A[i][j] with hvjr , ri.
CHAPTER 28. BG SIMULATION 298
for consensus or k-set agreement, but fails pretty badly for renaming. The
extended BG simulation, due to Gafni [Gaf09], solves this problem by
mapping each simulating process p to a specific simulated process qp , and
using a more sophisticated simulation algorithm to guarantee that qp doesn’t
crash unless p does; details can be found in Gafni’s paper. There is also a
later paper by Imbs and Raynal [IR09] that simplifies some details of the
construction. Here, we will limit ourselves to the basic BG simulation.
Topological methods
301
CHAPTER 29. TOPOLOGICAL METHODS 302
has been divided into smaller triangles necessarily contain a small triangle
with three different colors on its corners. This connection between k-set
agreement and Sperner’s Lemma became the basic idea behind each the three
independent proofs of the conjecture that appeared shortly thereafter [HS99,
BG93, SZ00], all of which adopted an approach that reduces decision problems
in distributed systems to the existence of certain structures in combinatorial
topology.
Our plan is to give a sufficient high-level description of the topological
approach that the connection between k-set agreement and Sperner’s Lemma
becomes obvious. It is possible to avoid this by approaching the problem
purely combinatorially, as is done, for example, in Section 16.3 of [AW04].
The presentation there is obtained by starting with a topological argument
and getting rid of the topology (in fact, the proof in [AW04] contains a proof
of Sperner’s Lemma with the serial numbers filed off). The disadvantage of
this approach is that it obscures what is really going in and makes it harder to
obtain insight into how topological techniques might help for other problems.
The advantage is that (unlike these notes) the resulting text includes actual
proofs instead of handwaving.
position part means that the xi are not all contained in some subspace of
dimension (k −1) or smaller (so that the simplex isn’t squashed flat somehow).
What this gives us is a body with (k + 1) corners and (k + 1) faces, each of
which is a (k − 1)-dimensional simplex (the base case is that a 0-dimensional
simplex is a point). Each face includes all but one of the corners, and each
corner is on all but one of the faces. So we have:
• 0-dimensional simplex: point.2
q1 p1
q1 p1
CHAPTER 29. TOPOLOGICAL METHODS 306
One thing to notice about this output complex is that it is not connected:
there is no path from the p0–q0 component to the q1–p1 component.
Here is a simplicial complex describing the possible states of two processes
p and q, after each writes 1 to its own bit then reads the other process’s bit.
Each node in the picture is labeled by a sequence of process IDs. The first
ID in the sequence is the process whose view this node represents; any other
process IDs are processes this first process sees (by seeing a 1 in the other
process’s register). So p is the view of process p running by itself, while pq
is the view of process p running in an execution where it reads q’s register
after q writes it.
p qp pq q
The edges express the constraint that if we both write before we read,
then if I don’t see your value you must see mine (which is why there is no
p–q edge), but all other combinations are possible. Note that this complex
is connected: there is a path between any two points.
Here’s a fancier version in which each process writes its input (and
remembers it), then reads the other process’s register (i.e., a one-round full-
information protocol). We now have final states that include the process’s
own ID and input first, then the other process’s ID and input if it is visible.
For example, p1 means p starts with 1 but sees a null and q0p1 means q starts
with 0 but sees p’s 1. The general rule is that two states are compatible if p
either sees nothing or q’s actual input and similarly for q, and that at least
one of p or q must see the other’s input. This gives the following simplicial
complex:
p0 q0p0 p0q0 q0
q1p0 p1q0
p0q1 q0p1
q1 p1q1 q1p1 p1
The fact that this looks like four copies of the p–qp–pq–q complex pasted
into each edge of the input complex is not an accident: if we fix a pair of
inputs i and j, we get pi–qjpi–piqj–qj, and the corners are pasted together
because if p sees only p0 (say), it can’t tell if it’s in the p0/q0 execution or
the p0/q1 execution.
The same process occurs if we run a two-round protocol of this form,
where the input in the second round is the output from the first round. Each
round subdivides one edge from the previous round into three edges:
p−q
p − qp − pq − q
Here (pq)(qp) is the view of p after seeing pq in the first round and seeing
that q saw qp in the first round.
29.3.2 Subdivisions
In the simple write-then-read protocol above, we saw a single input edge turn
into 3 edges. Topologically, this is an example of a subdivision, where we
represent a simplex using several new simplexes pasted together that cover
exactly the same points.
Certain classes of protocols naturally yield subdivisions of the input
complex. The iterated immediate snapshot (IIS) model, defined by
Borowsky and Gafni [BG97], considers executions made up of a sequence
of rounds (the iterated part) where each round is made up of one or more
mini-rounds in which some subset of the processes all write out their current
views to their own registers and then take snapshots of all the registers (the
immediate snapshot part). The two-process protocols of the previous section
are special cases of this model.
Within each round, each process p obtains a view vp that contains the
previous-round views of some subset of the processes. We can represent the
views as a subset of the processes, which we will abbreviate in pictures by
putting the view owner first: pqr will be the view {p, q, r} as seen by p, while
qpr will be the same view as seen by q. The requirements on these views
are that (a) every process sees its own previous view: p ∈ vp for all p; (b)
CHAPTER 29. TOPOLOGICAL METHODS 308
all views are comparable: vp ⊆ vq or vq ⊆ vp ; and (c) if I see you, then I see
everything you see: q ∈ vp implies vq ⊆ vp . This last requirement is called
immediacy and follows from the assumption that writes and snapshots are
done in the same mini-round: if I see your write, then I see all the values
you do, because your snapshot is either in an earlier mini-round than mine
or in the same mini-round. Note this depends on the peculiar structure of
the mini-rounds, where all the writes precede all the snapshots.
The IIS model does not correspond exactly to a standard shared-memory
model (or even a standard shared-memory model augmented with cheap
snapshots). There are two reasons for this: standard snapshots don’t provide
immediacy, and standard snapshots allow processes to go back and perform
more than one snapshot on the same object. The first issue goes away if
we are looking at impossibility proofs, because the adversary can restrict
itself only to those executions that satisfy immediacy; alternatively, we can
get immediacy from the participating set protocol of [BG97], which we
will describe in §29.6.1. The second issue is more delicate, but Borowsky
and Gafni demonstrate that any decision protocol that runs in the standard
model can be simulated in the IIS model, using a variant of the BG simulation
algorithm described in Chapter 28.
For three processes, one round of immediate snapshots gives rise to the
simplicial complex depicted in Figure 29.1. The corners of the big triangle
are the solo views of processes that do their snapshots before anybody else
shows up. Along the edges of the big triangle are views corresponding to
2-process executions, while in the middle are complete views of processes that
run late enough to see everything. Each little triangle corresponds to some
execution. For example, the triangle with corners p, qp, rpq corresponds to
a sequential execution where p sees nobody, q sees p, and r sees both p and
q. The triangle with corners pqr, qpr, and rpq is the maximally-concurrent
execution where all three processes write before all doing their snapshots:
here everybody sees everybody. It is not terribly hard to enumerate all
possible executions and verify that the picture includes all of them. In higher
dimension, the picture is more complicated, but we still get a subdivision
that preserves the original topological structure [BG97].
Figure 29.2 shows (part of) the next step of this process: here we have
done two iterations of immediate snapshot, and filled in the second-round
subdivisions for the p–qpr–rpq and pqr–qpr–rpq triangles. (Please imagine
similar subdivisions of all the other triangles that I was too lazy to fill in
by hand.) The structure is recursive, with each first-level triangle mapping
to an image of the entire first-level complex. As in the two-process case,
adjacent triangles overlap because the relevant processes don’t have enough
CHAPTER 29. TOPOLOGICAL METHODS 309
rp qp
qpr rpq
pr pq
pqr
r q
qr rq
rp qp
qpr rpq
pr pq
pqr
r q
qr rq
information; for example, the points on the qpr–rpq edge correspond to views
of q or r that don’t include p in round 2 and so can’t tell whether p saw p or
pqr in round 1.
The important feature of the round-2 complex (and the round-k complex
in general) is that it’s a triangulation of the original outer triangle: a
partition into little triangles where each corner aligns with corners of other
little triangles.
(Better pictures of this process in action can be found in Figures 25 and
26 of [HS99].)
2 1
1 1
1 3
2 3 3 3
that a process can only choose an output that it can see among the inputs in
its view. This means that at the corners of the outer triangle (corresponding
to views where the process thinks it’s alone), a process must return its input,
while along the outer edges (corresponding to views where two processes may
see each other but not the third), a process must return one of the two inputs
that appear in the corners incident to the edge. Internal corners correspond
to views that include—directly or indirectly—the inputs of all processes, so
these can be labeled arbitrarily. An example is given in Figure 29.3, for a
one-round protocol with three processes.
We now run into Sperner’s Lemma [Spe28], which says that, for any
subdivision of a simplex into smaller simplexes, if each corner of the original
simplex has a different color, and each corner that appears on some face of
the original simplex has a color equal to the color of one of the corners of
that face, then within the subdivision there are an odd number of simplexes
whose corners are all colored differently.3
3
The proof of Sperner’s Lemma is not hard, and is done by induction on the dimension
k. For k = 0, any subdivision consists of exactly one zero-dimensional simplex whose single
CHAPTER 29. TOPOLOGICAL METHODS 312
consistent.
In simplicial complex terms, this means that the mapping from states
to outputs is a simplicial map, a function f from points in one simplicial
complex C to points in another simplicial complex D such that for any
simplex A ∈ C, f (A) = {f (x)|x ∈ A} gives a simplex in D. (Recall that
consistency is represented by including a simplex, in both the state complex
and the output complex.) A mapping from states to outputs that satisfies
the consistency requirements encoded in the output complex s always a
simplicial map, with the additional requirement that it preserves process IDs
(we don’t want process p to decide the output for process q). Conversely,
any id-preserving simplicial map gives an output function that satisfies the
consistency requirements.
Simplicial maps are examples of continuous functions, which have all
sorts of nice topological properties. One nice property is that a continuous
function can’t separate a path-connected space ( one in which there is a
path between any two points) into path-disconnected components. We can
prove this directly for simplicial maps: if there is a path of 1-simplexes
{x1 , x2 }, {x2 , x3 }, . . . {xk−1 , xk } from x1 to xk in C, and f : C → D is a
simplicial map, then there is a path of 1-simplexes {f (x1 ), f (x2 )}, . . . from
f (x1 ) to f (xk ). Since being path-connected just means that there is a path
between any two points, if C is connected we’ve just shown that f (C) is as
well.
Getting back to our consensus example, it doesn’t matter what simplicial
map f you pick to map process states to outputs; since the state complex C
is connected, so is f (C), so it lies entirely within one of the two connected
components of the output complex. This means in particular that everybody
always outputs 0 or 1: the protocol is trivial.
Protocol implies map Even though we don’t get a subdivision with the
full protocol, there is a restricted set of executions that does give a
subdivision. So if the protocol works on this restricted set of execu-
tions, an appropriate map exists. There are two ways to prove this:
Herlihy and Shavit do so directly, by showing that this restricted set
of executions exists, and Borowksy and Gafni [BG97] do so indirectly,
by showing that the IIS model (which produces exactly the standard
chromatic subdivision used in the ACT proof) can simulate an ordinary
snapshot model. Both methods are a bit involved, so we will skip over
this part.
CHAPTER 29. TOPOLOGICAL METHODS 315
Map implies protocol This requires an algorithm. The idea here is that
that participating set algorithm, originally developed to solve k-set
agreement [BG93], produces precisely the standard chromatic subdivi-
sion used in the ACT proof. In particular, it can be used to solve the
problem of simplex agreement, the problem of getting the processes
to agree on a particular simplex contained within the subdivision of
their original common input simplex. This is a little easier to explain,
so we’ll do it.
The following theorem shows that the return values from participating
set have all the properties we want for iterated immediate snapshot:
Theorem 29.6.2. Let Si be the output of the participating set algorithm for
process i. Then all of the following conditions hold:
Proof. Self-inclusion is trivial, but we will have to do some work for the other
two properties.
We will show that Algorithm 29.1 neatly sorts the processes out into
levels, where each process that returns at level ` returns precisely the set of
processes at level ` and below.
For each process i, let Si be the set of process IDs that i returns, let `i
be the final value of level[i] when i returns, and let Si0 = {j | `j ≤ `i }. Our
goal is to show that Si0 = Si , justifying the above claim.
Because no process ever increases its level, if process i observes level[j] ≤ `i
in its last snapshot, then `j ≤ level[j] ≤ `i . So Si0 is a superset of Si . We
thus need to show only that no extra processes sneak in; in particular, we
will to show that |Si | = |Si0 |, by showing that both equal `i .
The first step is to show that |Si0 | ≥ |Si | ≥ `i . The first inequality follows
from the fact that Si0 ⊇ Si ; the second follows from the code (if not, i would
have stayed in the loop).
The second step is to show that |Si0 | ≤ `i . Suppose not; that is, suppose
that |Si0 | > `i . Then there are at least `i + 1 processes with level `i or less, all
of which take a snapshot on level `i + 1. Let i0 be the last of these processes
to take a snapshot while on level `i + 1. Then i0 sees at least `i + 1 processes
at level `i + 1 or less and exits, contradicting the assumption that it reaches
level `i . So |Si0 | ≤ `i .
The atomic snapshot property follows immediately from the fact that if
`i ≤ `j , then `k ≤ `i implies `k ≤ `j , giving Si = Si0 ⊆ Sj0 = Sj . Similarly, for
immediacy we have that if i ∈ Sj , then `i ≤ `j , giving Si ≤ Sj by the same
argument.
The missing piece for turning this into IIS is that in Algorithm 29.1, I
only learn the identities of the processes I am supposed to include but not
their input values. This is easily dealt with by the usual trick of adding an
extra register for each process, to which it writes its input before executing
participating set.
because any such simplicial map can be turned into a continuous function
(on the geometric version of I, which includes the intermediate points in
addition to the corners). Fortunately, topologists have many tools for proving
non-existence of continuous functions.
29.7.1 k-connectivity
Define the m-dimensional disk to be the set of all points at most 1 unit away
from the origin in Rm , and the m-dimensional sphere to be the surface of
the (m + 1)-dimensional disk (i.e., all points exactly 1 unit away from the
origin in Rm+1 ). Note that what we usually think of as a sphere (a solid
body), topologists call a disk, leaving the term sphere for just the outside
part.
An object is k-connected if any continuous image of an m-dimensional
sphere can be extended to a continuous image of an (m + 1)-dimensional
disk, for all m ≤ k.4 This is a roundabout way of saying that if we can draw
something that looks like a deformed sphere inside our object, we can always
include the inside as well: there are no holes that get in the way. The punch
line is that continuous functions preserve k-connectivity: if we want to map
an object with no holes continuously into some other object, the image had
better not have any holes either.
Ordinary path-connectivity is the special case when k = 0; here, the
0-sphere consists of two points and the 1-disk is the path between them. So
0-connectivity says that for any two points, there is a path between them.
For 1-connectivity, if we draw a loop (a path that returns to its origin),
we can include the interior of the loop somewhere. One way to thinking
about this is to say that we can shrink the loop to a point without leaving
the object (the technical term for this is that the path is null-homotopic,
where a homotopy is a way to transform one thing continuously into another
thing over time and the null path sits on a single point). An object that is
1-connected is also called simply connected.
For 2-connectivity, we can’t contract a sphere (or box, or the surface of a
2-simplex, or anything else that looks like a sphere) to a point.
The important thing about k-connectivity is that it is possible to prove
that any subdivision of a k-connected simplicial complex is also k-connected
(sort of obvious if you think about the pictures, but it can also be proved
formally), and that k-connectivity is preserved by simplicial maps (if not,
4
This definition is for the topological version of k-connectivity. It is not related in any
way to the definition of k-connectivity in graph theory, where a graph is k-connected if
there are k disjoint paths between any two points.
CHAPTER 29. TOPOLOGICAL METHODS 318
b1 c2 a1 b2 c1 a2 b1
c3 a4 b3 c4 a3 b4 c3
a1 b2 c1 a2 b1 c2 a1
Approximate agreement
Validity Every process returns an output within the range of inputs. For-
mally, for all i, it holds that (minj xj ) ≤ yi ≤ (maxj xj ).
320
CHAPTER 30. APPROXIMATE AGREEMENT 321
1 A[i] ← hxi , 1, xi i
2 repeat
3 hx01 , r1 , v1 i . . . hx0n , rn , vn i ← snapshot(A)
4 rmax ← maxj rj
5 v ← midpoint{vj | rj = rmax }
6 A[i] ← hxi , rmax + 1, vi
7 until rmax ≥ 2 and rmax ≥ log2 (spread({x0j })/)
8 return v
Algorithm 30.1: Approximate agreement
be the set of all values v that are ever written to the snapshot object with
round number r. Let Ur ⊆ Vr be the set of values that are ever written to
the snapshot object with round number r before some process writes a value
with round number r + 1 or greater; the intuition here is that Ur includes only
those values that might contribute to the computation of some round-(r + 1)
value.
Proof. Let Uri be the set of round-r values observed by a process i in the
iteration in which it sees rmax = r in some iteration, if such an iteration
exists. Note that Uri ⊆ Ur , because if some value with round r + 1 or greater
is written before i’s snapshot, then i will compute a larger value for rmax .
Given two processes i and j, we can argue from the properties of snapshot
that either Uri ⊆ Urj or Urj ⊆ Uri . The reason is that if i’s snapshot comes
first, then j sees at least as many round-r values as i does, because the only
way for a round-r value to disappear is if it is replaced by a value in a later
round. But in this case, process j will compute a larger value for rmax and
will not get a view for round r. The same holds in reverse if j’s snapshot
comes first.
Observe that if Uri ⊆ Urj , then
This holds because midpoint(Uri ) lies within the interval min Urj , max Urj ,
and every point in this interval is within spread(Urj )/2 of midpoint(Urj ). The
same holds if Urj ⊆ Uri . So any two values written in round r + 1 are within
spread(Ur )/2 of each other.
In particular, the minimum and maximum values in Vr+1 are within
spread(Ur )/2 of each other, so spread(Vr+1 ) ≤ spread(Ur )/2.
compute
Let i be some process that finishes in the fewest number of rounds. Process
i can’t finish until it reaches round rmax +1, where rmax ≥ log2 (spread({x0j })/)
for a vector of input values x0 that it reads after some process writes round
2 or greater. We have spread({x0j }) ≥ spread(U1 ), because every value in
U1 is included in x0 . So rmax ≥ log2 (spread(U1 )/) and spread(Vrmax +1 ) ≤
spread(U1 )/2rmax ≤ spread(U1 )/(spread(U1 )/) = . Since any value re-
turned is either included in Vrmax +1 or some later Vr0 ⊆ Vrmax +1 , this gives
us that the spread of all the outputs is less than : Algorithm 30.1 solves
approximate agreement.
The cost of Algorithm 30.1 depends on the cost of the snapshot operations,
on , and on the initial input spread D. For linear-cost snapshots, this works
out to O(n log(D/)).
again and repeat until both p and q have pending writes that will change the
other process’s preference. Let p1 and q1 be the new preferences that result
from these operations. The adversary can now choose between running P only
and getting to a configuration with preferences p0 and q1 , Q only and getting
p1 and q0 , or both and getting p1 and q1 ; each of these choices incurs at least
one step. By the triangle inequality, |p0 − q0 | ≤ |p0 − q1 |+|q1 − p1 |+|p1 − q0 |,
so at least on of these configurations has a spread between preferences that is
at least 1/3 of the initial spread. It follows that after k steps the best spread
we can get is D/3k , requiring k ≥ log3 (D/) steps to get -agreement.
Herlihy uses this result to show that there are decisions problems that have
wait-free but not bounded wait-free deterministic solutions using registers.
Curiously, the lower bound says nothing about the dependence on the number
of processes; it is conceivable that there is an approximate agreement protocol
with running time that depends only on D/ and not n.
Part III
325
Chapter 31
Overview
In this part, we consider models that don’t fit well into the standard
message-passing or shared-memory models. These includes models where
processes can directly observe the states of nearby processes (Chapter 32);
where computation is inherently local and the emphasis is on computing
information about the communication graph (Chapter 33); where processes
wander about and exchange information only with processes they physically
encounter (Chapter 34); where processes (in the form of robots) communicate
only by observing each others’ locations and movements (Chapter 35); and
where processes can transmit only beeps, and are able to observe only whether
at least one nearby process beeped (Chapter 36).
Despite the varying communication mechanisms, these models all share
the usual features of distributed systems, where processes must contend with
nondeterminism and incomplete local information.
326
Chapter 32
Self-stabilization
327
CHAPTER 32. SELF-STABILIZATION 328
examples, the simplest of which we will discuss in §32.2 below. These became
the foundation for the huge field of self-stabilization, which spans thousands
of papers, at least one textbook [Dol00], a specialized conference (SSS, the
International Symposium on Stabilization, Safety, and Security in Distributed
Systems), and its own domain name http://www.selfstabilization.org/.
We won’t attempt to summarize all of this, but will highlight a few results
to give a sampling of what self-stabilizing algorithms look like.
32.1 Model
Much of the work in this area, dating back to Dijkstra’s original paper, does
not fit well in either the message-passing or shared-memory models that we
have been considering in this class, both of which were standardized much
later. Instead, Dijkstra assumed that processes could, in effect, directly
observe the states of their neighbors. A self-stabilizing program would
consist of a collection of what he later called guarded commands [Dij75],
statements of the form “if [some condition is true] then [update my state in
this way].” In any configuration of the system, one or more of these guarded
commands might have the if part (the guard) be true; these commands are
said to be enabled.
A step consists of one or more of these enabled commands being executed
simultaneously, as chosen by an adversary scheduler, called the distributed
daemon. The usual fairness condition applies: any process that has an
enabled command eventually gets to execute it. If no commands are enabled,
nothing happens. With the central daemon variant of the model, only one
step can happen at a time. With the synchronous daemon, every enabled
step happens at each time. Note that both the central and synchronous
daemons are special cases of the distributed daemon.
More recent work has tended to assume a distinction between the part
of a process’s state that is visible to its neighbors and the part that isn’t.
This usually takes the form of explicit communication registers or link
registers, which allow a process to write a specific message for a specific
neighbor. This is still not quite the same as standard message-passing or
shared-memory, because a process is often allowed to read and write multiple
link registers atomically.
CHAPTER 32. SELF-STABILIZATION 329
In this algorithm, the nonzero processes just copy the state of the process
to their left. The zero process increments its state if it sees the same state
to its left. Note that the nonzero processes have guards on their commands
that might appear useless at first glance, but these are there ensure that
the adversary can’t waste steps by getting nonzero processes to carry out
operations that have no effect.
What does this have to with tokens? The algorithm includes an additional
interpretation of the state, which says that:
1. If `0 = `n−1 , then 0 has a token, and
2. If `i 6= `i−1 , for i 6= 0, then i has a token.
Like the update rule, the token rule can be evaluated by a node that can
only see its predecessor. This allows it to do detect when it acquires the
token and do whatever leaderly things it needs to before applying an update
to pass it on to the next process.
Using the token rule instantly guarantees that there is at least one token:
if none of the nonzero processes have a token, then all the `i variables are
1
In Dijkstra’s paper, there are n + 1 processes numbered 0 . . . n, but this doesn’t really
make any difference.
CHAPTER 32. SELF-STABILIZATION 330
equal. But then 0 has a token. It remains though to show that we eventually
converge to a configuration where at most one process has a token.
Define a configuration ` as legal if there is some value j such that `i = `j
for all i ≤ j and `i = `j − 1 (mod n + 1) for all i > j. When j = n − 1,
this makes all `i equal, and 0 has the only token. When j < n − 1, then
`0 6= `n−1 (so 0 does not have a token), `j 6= `j+1 (so j + 1 has a token), and
`i = `i+1 for all i 6∈ j, n − 1 (so nobody else has a token). That each legal
configuration has exactly one token partially justifies our definition of legal
configurations.
If a configuration ` is legal, then when j = n − 1, the only enabled step
is `00 ← (`n−1 + 1) mod (n + 1); when j < n − 1, the only enabled step is
`0j+1 ← `j . In either case, we get a new legal configuration `0 . So the property
of being a legal configuration is stable, which is the other half of justifying
our definition.
Now we want to show that we eventually converge to a legal configuration.
Fix some initial configuration `0 , and let c be some value such that `0i 6= c
for all i. (There is at least one such c by the Pigeonhole Principle.) We will
argue that there is a sequence of configurations with c as a prefix of the
values that forms a bottleneck forcing us into a legal configuration.
Proof. By induction on t. For the base case, `0 satisfies `0i = c if and only if
i < j when j = 0.
If `t is legal, `t+1 is also legal. So the interesting case is when `t is not
legal. In this case, there is some 0 ≤ j < n such that `ti = c if and only if
i < j.
If j = 0, then `ti 6= c for all i. Then the only way to get `t+1 i = c is if
i = 0. But then ` t+1 0
satisfies the condition with j = 1.
If 0 < j < n, then `ti = c for at least one i < j, and `tn−1 = 6 c since
n − 1 6< j. So we may get a transition that sets `t+1 j = `t
j−1 = c, giving a
new configuration ` t+1 0
that satisfies the induction hypothesis with j = j + 1,
or we may get a transition that does not create or remove any copies of c.
In either case the induction goes through.
to be the gap between `0 and c. For each i ∈ {0, . . . , n − 2}, define ui (`) =
[`i 6= `i+1 ] to be the indicator variable for whether i is unhappy with its
successor, because its successor has not yet agreed to adopt its value.2 The
idea is that unhappiness moves right when some i 6= 0 copies its predecessor
and that the gap drops when 0 increments its value. By weighting these
values appropriately, we can arrange for a function that always drops.
Let
n−2
X
Φ(`) = ng(`) + (n − 1 − i)ui (`). (32.2.1)
i=0
Most of the work here is being done by the first two terms. The g term
tracks the gap between `0 and c, weighted by n. The sum tracks unhappiness,
weighted by distance to position n − 1.
In the initial configuration `0 , g is at most n, and each ui is at most 1,
so Φ(`0 ) = O(n2 ). We also have that Φ ≥ 0 always; if Φ = 0, then g = 0 and
ui = 0 for all i implies we are in an all-c configuration, which is legal. So we’d
like to argue that every step of the algorithm in a non-legal configuration
reachable from `0 reduces Φ by at least 1, forcing us into a legal configuration
after O(n2 ) steps.
Consider any step of the algorithm starting from a non-legal configuration
`t with Φ(`t ) > 0 that satisfies the condition in Lemma 32.2.1:
Since the condition of Lemma 32.2.1 holds for any reachable `t , as long
as we are in a non-legal configuration, Φ drops by at least 1 per step. If we
do not reach a legal configuration otherwise, Φ can only drop O(n2 ) times
before hitting 0, giving us a legal configuration. Either way, the configuration
stabilizes in O(n2 ) steps.
2
The notation [P ], where P is some logical predicate, is called an Iverson bracket
and means the function that is 1 when P is true and 0 when P is false.
CHAPTER 32. SELF-STABILIZATION 332
32.3 Synchronizers
Self-stabilization has a curious relationship with failures: the arbitrary initial
state corresponds to an arbitrarily bad initial disruption of the system, but
once we get past this there are no further failures. So it is not surprising that
many of the things we can do in a failure-free distributed system we can also
do in a self-stabilizing system. One of these is to implement a synchronizer,
which will allow us to pretend that our system is synchronous even if it isn’t.
The synchronizer we will describe here, due to Awerbuch et al. [AKM+ 93,
AKM+ 07], is a variant of the alpha synchronizer. It assumes that each
process can observe the states of its neighbors and that we have a central
daemon (meaning that one process takes a step at a time).
To implement this synchronizer in a self-stabilizing system, each process
v has a variable P (v), its current pulse. We also give each process a rule for
adjusting P (v) when it takes a step. Our goal is to arrange for every v to
increment its pulse infinitely often while staying at most one ahead of its
neighbors N (v). Awerbuch et al. give several possible rules for achieving
this goal, and consider the effectiveness of each.
The simplest rule is taken directly from the alpha synchronizer. When
activated, v sets
P (v) ← min (P (u) + 1)
u∈N (v)
This rule works find as long as every process starts synchronized. But
it’s not self-stabilizing. A counterexample, given in the paper, assumes we
have 10 processes organized in a ring. By carefully choosing which processes
are activated at each step, we can go through the following sequence of
configurations, where in each configuration the updated node is shown in
boldface:
1234312343
1234212343
1234232343
1234234343
1234234543
1234234542
3234234542
3434234542
3454234542
this process indefinitely. But at the same time, each configuration has at
least one pair of adjacent nodes whose values differ by more than one.
The problem that arises in the counterexample is that sometimes values
can go backwards. A second rule proposed by Awerbuch et al. avoids this
problem by preventing this, using the rule:
!
P (v) ← max P (v), min (P (u) + 1)
u∈N (v)
1 1 1050
If we run the nodes in round-robin order, the left two nodes will eventually
catch up to the rightmost, but it will take a while.
After some further tinkering, the authors present their optimal rule, which
they call max minus one:
(
minu∈N (v) (P (u) + 1) if P (v) looks legal,
P (v) ←
maxu∈N (v) (P (u) − 1) otherwise.
where d(u, v) is the distance between u and v in the graph. This is zero if
the skew between any pair of nodes is equal to the distance, which is the
most we can expect from a synchronizer. The proof shows that applying
the max-minus-one rule never increases φ(v), and decreases it by at least 1
whenever a node v with positive φ(v) changes P (v). Because this only gives
3
Defining a time unit as a minimum interval in which every process takes at least one
step.
CHAPTER 32. SELF-STABILIZATION 334
P
a bound of φ(v), which can be arbitrarily big, the rest of the proof uses a
second potential function
which measures the distance from v to the nearest node u that supplies the
maximum in φ(v). It is shown that Φ(v) drops by 1 per time unit. When it
reaches 0, then φ(v) = P (v) − P (v) − d(v, v) = 0. Since Φ(v) can never start
at more than the diameter D, this implies convergence in D time units.
The intuition for why this works is that if the closest node u to v with
P (u) too high is at distance d, then max-minus-one will pull P (w) up for
some node w at distance d − 1 the next time w takes a step. The full set of
cases is more complicated, and we’ll skip over the details of the argument
here. If you are interested, the presentation in the paper is not too hard to
follow.
The important part is that once we have a synchronizer, we can effectively
assume synchrony in other self-stabilizing algorithms. We just run the
synchronizer underneath our main protocol, and when the synchronizer
stabilizes, that gives us the initial starting point for the main protocol.
Because the main protocol itself should stabilize starting from an arbitrary
configuration, any insanity produced while waiting for the synchronizer to
converge is eventually overcome.
This fake root will rapidly propagate through the other nodes, with distances
increasing without bound over time. For most graphs, the algorithm will
never converge to a single spanning tree.
Awerbuch et al. [AKM+ 93] solve this problem by assuming a known
upper bound on the maximum diameter of the graph. Because the distance
to a ghost root will steadily increase over time, eventually only real roots
will remain, and the system will converge to a correct BFS tree.
It’s easiest to show this if we assume synchrony, or at least some sort
of asynchronous round structure. Define a round as the minimum time for
every node to update at least once. Then the minimum distance for any
ghost root rises by at least one per round, since any node with the minimum
distance either has no neighbor with the ghost root (in which case it picks
a different root), or any neighbor that has the ghost root has at least the
same distance (in which case it increases its distance) Once the minimum
distance exceeds the upper bound D0 , all the ghost roots will have been
eliminated, and only real distances will remain. This gives a stabilization
time (in rounds) linear in the upper bound on the diameter.
We can now argue that, after stabilization, this process eventually con-
verges to Tu consisting precisely of the set of all pairs hw, xv i where w is a
u–v path of length at most f (n) and xv is the input to v. Indeed, this works
under almost any reasonable assumption about scheduling. The relevant
lemma:
Lemma 32.5.1. Starting from any initial configuration, for any sequence
w of at most f (n) vertices starting at u and ending at v, if (32.5.1) fires for
each node in w in reverse order, then Tu (w) = xv if w is a u–v path, and
Tu (w) = ⊥ otherwise.
Proof. The proof is by induction on the length of w. The base case is when
|w| = 1, implying w = u = v. Here rule (32.5.1) writes hu, xu i to Tu , giving
Tu (u) = xu as claimed.
For a sequence w = uw0 where w0 is a nonempty path from some node u0
to v, if u0 is a neighbor of u, then firing rule (32.5.1) at u after firing the rule
for each node in w0 has Tu (uw0 ) ← Tu0 (w) = xv by the induction hypothesis.
If uw0 is not a path from u to v, then either u0 is not a neighbor of u, or w0
is not a path from u0 to v and Tu0 (w0 ) = ⊥ by the induction hypothesis. In
either case, Tu (uw0 ) ← ⊥.
337
CHAPTER 33. DISTRIBUTED GRAPH ALGORITHMS 338
might require the graph to be from some restricted class (e.g., rings, trees,
cliques).
We’ll mostly focus on the LOCAL model in this chapter, studying it
through the problem of graph coloring.
Assuming that the largest initial color is N , the largest possible value
for i is blg N c, and so the largest possible value for 2i + xi is 2blg N c + 1.
Iterating the function 2blg N c + 1 converges to at most 5 after O(log∗ N )
rounds, which gives us six colors 0, . . . , 5, where no two adjacent processes
have the same color.
To reduce this to three colors, add a phase for each c ∈ {3, 4, 5} to
eliminate c. In each phase, we carry out a two-stage process. The first stage
cleans up the neighborhood around each node, and the second stage replaces
all copies of c with some color in {0, 1, 2}.
In the first stage, we shift all colors down, by having each node switch
its color to that of its successor (or some new color chosen from {0, 1, 2} if
it doesn’t have a successor). The reason for doing this is that it guarantees
that each node’s predecessors will all share the same color, meaning that
that node now has at most two colors represented among its predecessors
and successor. At the same time, it doesn’t create any new pair of adjacent
nodes with the same color.
For the second stage, each node v that currently has color c chooses
a new color from {0, 1, 2} that is the smallest color that doesn’t appear
in its neighborhood. Since none of v’s neighbors change color during this
stage (they don’t have color c), this replaces all instances of c with a color
from {0, 1, 2} while keeping all edges two-colored. After doing this for all
c ∈ {3, 4, 5}, the only colors left are in {0, 1, 2}.
Doing the 6 to 3 reduction in the obvious way takes an additional 6
rounds, which is (asymptotically) dominated by the O(log∗ N ) rounds of
reducing from initial IDs with values up to N .
Because the reduction to 6 colors technically requires more than constant
time, it’s theoretically necessary for the nodes to have an upper bound on
O(log∗ N ) to know when to switch to the 6 → 3 step. In practice, log∗ N ≤ 7
for any N that can be represented by bits encoded using subatomic particles
contained in the visible universe, so we may be able to get away with fixing
a constant. Despite this useful property of log∗ in practice, we can’t get rid
of it in theory, because of an Ω(log∗ n) bound on coloring rings shown in the
next section.
Lemma 33.2.1 ([LS14, Lemma 2]). For k > 1, given a k-ary c-coloring
function A, it is possible to construct a (k − 1)-ary 2c -coloring function B.
Suppose now that (33.2.1) does not hold for B, that is, there is some
increasing sequence (x1 , . . . , xk ) such that B(x1 , . . . , xk−1 ) = B(x2 , . . . , xk ),
or equivalently B 0 (x1 , . . . , xk−1 ) = B 0 (x2 , . . . , xk ).
We will feed this bad sequence to A and see what happens. Let α =
A(x1 , . . . , xk ). Since xk is one of the possible extensions of (x1 , . . . , xk−1 )
used to generate B 0 (x1 , . . . , xk−1 ), we get α ∈ B 0 (x1 , . . . , xk−1 ). But then α
is also contained in B 0 (x2 . . . , xk ) = B 0 (x1 , . . . , xk−1 ). From the definition
of B 0 (x2 , . . . , xk ), this implies that there is some xk+1 > xk such that α =
A(x2 , . . . , xk , xk+1 ) = A(x1 , x2 , . . . , xk ). But then A is not a k-ary c-coloring
function.
To get the Ω(log∗ n) lower bound, start with a k-ary 3-coloring function
and iterate Lemma 33.2.1 to get a 1-ary f (k − 1)-coloring function where
f (k) is the result of iteratively applying the function 2x to 3, k − 1 times.
Then f (k − 1) ≥ n, which implies k = Ω(log∗ n).
Population protocols
H2 + O2 → H2 O + O
343
CHAPTER 34. POPULATION PROTOCOLS 344
graph (which means uniformly from all possible pairs when the interaction
graph is complete).
Assuming random scheduling (and allowing for a small probability of
error) greatly increases the power of population protocols. So when using this
time measure we have to be careful to mention whether we are also assuming
random scheduling to improve our capabilities. Most of the protocols in
this section are designed to work as long as the scheduling satisfies global
fairness—they don’t exploit random scheduling—but we will discuss running
time in the random-scheduling case as well.
34.2.2 Examples
These examples are mostly taken from the original paper of Angluin et
al. [AAD+ 06].
L, L → L, F.
It is easy to see that in any configuration with more than one leader,
there exists a transition that eventually reduces the number of leaders. So
global fairness says this happens eventually, which causes us to converge to
a single leader after some finite number of interactions.
If we assume random scheduling, the expected number of transitions to
get down to one leader is exactly
n n
X n(n − 1) X 1
= n(n − 1)
k=2
k(k − 1) k=2
k(k − 1)
n
1 1
X
= n(n − 1) −
k=2
k−1 k
1
= n(n − 1) 1 −
n
= n2 .
CHAPTER 34. POPULATION PROTOCOLS 347
This protocol satisfies the invariant that the sum over all agents of
the second component, mod m, is unchanged by any transition. Since the
components for any is follower is zero, this means that when we converge to
a unique leader, it will contain the count of initial A’s mod m.
where the xi are the counts of various possible inputs and the ai and b are
integer constants. This includes comparisons like x1 > x2 as a special case.
The idea is to compute a truncated version of the left-hand side of (34.2.1)
as a side-effect of leader election.
CHAPTER 34. POPULATION PROTOCOLS 348
Fix some k > max(|b|, maxi |ai |). In addition to the leader bit, each agent
stores an integer in the range −k through k. The input map sends each xi to
the corresponding coefficient ai , and the transition rules cancel out positive
and negative ai , and push any remaining weight to the leader as much as
possible subject to the limitation that values lie within [−k, k].
Formally, define a truncation function t(x) = max(−k, min(k, r)), and a
remainder function r(x) = x − t(x). These have the property that if |x| ≤ 2k,
then t(x) and r(x) both have their absolute value bounded by k. If we have
the stronger condition |x| ≤ k, then t(x) = x and r(x) = 0.
We can now define the transition rules:
These have the property that the sum of the second components is
preserved by all transitions. Formally, if we write yi for the second component
P
of agent i, then yi does not change through the execution of the protocol.
When agents with positive and negative values meet, we get cancellation.
This reduces the quantity |yi |. Global fairness implies that this quantity
P
will continue to drop until eventually all nonzero yi have the same sign.
Once this occurs, and there is a unique leader, then the leader will
eventually absorb as much of the total as it can. This will leave the leader
P
with y = min(k, max(−k, yi )). By comparing this quantity with b, the
leader can compute the threshold predicate.
∃y : x = y + y
∃z : x = y + z + 1
A, − → A0 , B
xy → xb
yx → yb
xb → xx
bx → xx
yb → yy
by → yy
What if there is no x agent? Then the query goes out but nothing comes
back. If the leader can count off Θ(log n) time units (with an appropriate
constant, it can detect this. But it does not have enough states by itself to
count to Θ(log n).
The solution is to take advantage of the known spreading time for epi-
demics to build a phase clock out of epidemics. The idea here is that
the leader will always be in some phase 0 . . . m − 1. Non-leader agents
try to catch up with the leader by picking up on the latest rumor of the
leader’s phase, which is implemented formally by transitions of the form
hx, ii hF, ji → hx, ii hF, ii when 0 < i − j < m/2 (mod m). The leader on
the other hand is a hipster and doesn’t like it when everybody catches up; if
it sees a follower in the same phase, it advances to the next phase to maintain
its uniqueness: hL, ii hF, ii → hL, i + 1i hF, ii.
Because the current phase spreads like an epidemic, when the leader
advances to i + 1, every agent catches up in a log n time w.h.p. This means
both that the leader doesn’t spend too much time in i + 1 before meeting
a same-phase follower and that followers don’t get too far behind. (In
particular, followers don’t get so far behind that they start pulling other
followers forward.) But we also have that it takes at least b log n time
w.h.p. before more than n followers catch up. This gives at most an
n−1 1 probability that the leader advances twice in b log n time. By
making m large enough, the chances that this happens enough to get all
the way around the clock in less than, say b(m/2) log n) time can be made
at most n−c for any fixed c. So the leader can now count of Θ(log n) time
w.h.p., and in particular can use this to time any other epidemics that are
propagating around in parallel with the phase clock.
Angluin et al. use these techniques to implement various basic arithmetic
operations such as addition, multiplication, division, etc., on the counts of
agents in various states, which gives the register machine simulation. The
simulation can fail with nonzero probability, which is necessary because
otherwise it would allow implementing non-semilinear operations in the
adversarial scheduling model.
The assumption of an initial leader can be replaced by a leader election
algorithm, but at the time of the Angluin et al. paper, no leader election
algorithm better than the Θ(n)-time fratricide protocol described §34.2.2.1
was known, and even using this protocol requires an additional polynomial-
time cleaning step before we can run the main algorithm, to be sure that
there are no leftover phase clock remnants from deposed leaders to cause
trouble. So the question of whether this could be done faster remained open.
Hopes of finding a better leader election protocol without changing the
CHAPTER 34. POPULATION PROTOCOLS 353
model ended when Doty and Soloveichek [DS15] proved a matching Ω(n)
lower bound on the expected time to convergence for any leader election
algorithm in the more general model of chemical reaction networks. This
results holds assuming constant states and a dense initial population where
any state that appears is represented by a constant fraction of the agents.
Because of this and related lower bounds, recent work on fast population
protocols has tended to assume more states. This is a fast-moving area of
research, so I will omit trying to summarize the current state of the art here.
For an introduction to this work see [AG18, ER+ 18].
Chapter 35
Mobile robots
35.1 Model
We will start by describing the Suzuki-Yamashita model [SY99], the
CORDA model [Pri01], and some variants. We’ll follow the naming con-
ventions used by Agmon and Peleg [AP06].
Basic idea:
354
CHAPTER 35. MOBILE ROBOTS 355
– Anonymity: any two robots that see the same view take the
same action.
– Oblivious: The output of the compute phase is base only on
results of last look phase, and not on any previous observations.
Robots have no memory!
– No absolute coordinates: Translations of the space don’t change
the behavior of the robots.
– No sense of direction: robots don’t know which way is north.
More formally, if view v can be rotated to get view v 0 , then a
robot that sees v 0 will make the same move as in v, subject to the
same rotation.
– No sense of scale: robots don’t have a consistent linear measure.
If view v can be scaled to get view v 0 , then a robot that sees v 0
will move to the same point as in v, after applying the scaling.
– No sense of chirality: robots can’t tell counter-clockwise from
clockwise. Flipping a view flips the chosen move but has no other
effect.
– No ability to detect multiplicities: the view of other robots is a
set of points (rather than a multiset), so if two robots are on the
same point, they look like one robot.
– Fat robots: robots block the view of the robots behind them.
2. Only one robot moves. Without loss of generality, suppose the robot
at p moves to q. Then there is a different execution where q also moves
to p and the robots switch places.
In either case the distance between the two robots in the modified
execution is at least half the original distance. In particular, it’s not zero.
Note that this works even if the adversary can’t stop a robot in mid-move.
Both obliviousness and the lack of coordinates and sense of direction
are necessary. If the robots are not oblivious, then they can try moving
to the midpoint, and if only one of them moves then it stays put until the
other one catches up. If the robots have absolute coordinates or a sense of
direction, then we can deterministically choose one of the two initial positions
as the ultimate gathering point (say, the northmost position, or the westmost
position if both are equally far north). But if we don’t have any of this we
are in trouble.
Like the 3-process impossibility result for Byzantine agreement, the 2-
process impossibility result for robot gathering extends to any even number of
robots where half of them are on one point and half on the other. Anonymity
then means that each group of robots acts the same way a single robot would
if we activate them all together. Later work (e.g., [BDT12]) refers to this as
bivalent configuration, and it turns out to be the only initial configuration
for which it is not possible to solve gathering absent Byzantine faults.
However, once we have a Byzantine fault, this blows up. This is shown
by considering a lot of cases, and giving a strategy for the adversary and the
Byzantine robot to cooperate to prevent the other two robots from gathering
in each case. This applies to both algorithms for gathering and convergence:
the bad guys can arrange so that the algorithm eventually makes no progress
at all.
The first trick is to observe that any working algorithm for the n =
3, f = 1 case must be hyperactive: every robot attempts to move in every
configuration with multiplicity 1. If not, the adversary can (a) activate the
non-moving robot (which has no effect); (b) stall the moving non-faulty robot
if any, and (c) move the Byzantine robot to a symmetric position relative to
the first two so that the non-moving robot become the moving robot in the
next round and vice versa. This gives an infinite execution with no progress.
The second trick is to observe that if we can ever reach a configuration
where two robots move in a way that places them further away from each
other (a diverging configuration), then we can keep those two robots at
the same or greater distance forever. This depends on the adversary being
able to stop a robot in the middle of its move, which in turn depends on the
robot moving at least δ before the adversary stops it. But if the robots have
no sense of scale, then we can scale the initial configuration so that this is
not a problem.
CHAPTER 35. MOBILE ROBOTS 359
Beeping
The (discrete) beeping model was introduced by Cornejo and Kuhn [CK10]
to study what can be computed in a wireless network where communication
is limited to nothing but carrier sensing. According to the authors, the model
is inspired in part by some earlier work on specific algorithms based on carrier
sensing due to Scheideler et al. [SRS08] and Flury and Wattenhofer [FW10].
It has in turn spawned a significant literature, not only in its original domain
of wireless networking, but also in analysis of biological systems, which often
rely on very limited signaling mechanisms. Some of this work extends or
adjusts the capabilities of the processes in various ways, but the essential
idea of tightly limited communication remains.
In its simplest form, the model consists of synchronous processes organized
in an undirected graph. Processes wake up at arbitrary rounds chosen by the
adversary, and do not know which round they are in except by counting the
number of rounds since they woke. Once awake, a process chooses in each
round to either send (beep) or listen. A process that sends learns nothing in
that round. A process that listens learns whether any of its neighbors sends,
but not how many or which one(s).
From a practical perspective, the justification for the model is that carrier
sensing is cheap and widely available in radio networks. From a theoretical
perspective, the idea is to make the communication mechanism as restrictive
as possible while still allowing some sort of distributed computing. The
assumption of synchrony both adds to and limits the power of the model.
With no synchrony at all, it’s difficult to see how to communicate anything
with beeps, since each process will just see either a finite or infinite sequence
of beeps with not much correlation to its own actions. With continuous
361
CHAPTER 36. BEEPING 362
or there are constant ` and p such that the process beeps in round ` with
probability p. This follows because if the process is ever going to beep, there
is some first round ` where it might beep, and the probability that it does
so is constant because it depends only on the algorithm and the sequence b,
and not on n.
If an algorithm that hears only silence remains silent, then nobody ever
beeps, and nobody learns anything about the graph. Without knowing
anything, it’s impossible to correctly compute an MIS (consider a graph with
only two nodes that might or might not have an edge between them). This
means that in any working algorithm, there is some round ` and probability
p such that each process beeps with probability p after ` rounds of silence.
We can now beep the heck out of everybody by assembling groups of
1
Θ( p log n) processes and waking up each one ` rounds before we want them
to deliver their beeps. But we need to be a little bit careful to keep the
graph from being so connected that the algorithm finds an MIS despite this.
There are two cases, depending on what a process that hears only beeps
does:
1. If a process that hears only beeps stays silent forever, then we build
a graph with k − 1 cliques C1 , . . . , Ck−1 of size Θ( kp log n) each, and
a set of k cliques U1 , . . . , Uk of size Θ(log n) each. Here k ` is a
placeholder that will be filled in later (foreshadowing: it’s the biggest
value that doesn’t give us more than n processes). Each Ci clique is
further partitioned into subcliques Ci1 , . . . , Cik of size Θ( p1 log n) each.
Each Cij is attached to Uj by a complete bipartite graph.
We wake up each clique Ci in round i, and wake up all the U cliques
in round `. We can prove by induction on rounds that with high
probability, at least one process in each Cij beeps in round i + `, which
means that every process in every Ui hears a beep in the first k − 1
rounds that it is awake, and remains silent, causing the later C cliques
to continue to beep.
Because each Ci is a clique, each contains at most one element of the
M IS, and so between them they contain at most k − 1 elements of
the MIS. But there are k U cliques, so one of them is not adjacent to
any MIS element in a C clique. This means that one of the Uj must
contain an MIS element.
So now we ask when this extra Uj picks an MIS element. If it’s in the
first k − 1 rounds after it wakes up, then all elements have seen the
same history, so if any of them attempt to join the MIS then all of
CHAPTER 36. BEEPING 366
1 Leave MIS and restart the algorithm here upon hearing a beep
2 for c lg2 N rounds do
3 listen
4 for i ← 1 to lg N do
5 for c lg N rounds do
2i
6 with probability 8N do
7 beep
8 else
9 listen
10 Join MIS
11 while I don’t hear any beeps do
12 with probability 12 do
13 beep
14 listen
15 else
16 listen
17 beep;
4. The hard part: After O(log2 N log n) rounds, it holds with high prob-
ability that every node is either in the MIS or has a neighbor in the
MIS. This will give us that the alleged MIS is in fact maximal.
The bad case for termination is when some node u hears a neighbor v
that is then knocked out by one of its neighbors w. So now u is not in
the MIS, but neither is its (possibly only) neighbor v. The paper gives
a rather detailed argument that this can’t happen too often, which we
will not attempt to reproduce here. The basic idea is that if one of v’s
neighbors were going to knock v shortly after v first beeps, then the
sum of the probabilities of those neighbors beeping must be pretty high
(because at least one of them has to be beeping instead of listening
when v beeps). But they don’t increase their beeping probabilities
very fast, so if this is the case, then with high probability one of them
would have beeped in the previous c log N rounds before v does. So
the most likely scenario is that v knocks out u and knocks out the
rest of its neighbors at the same time, causing it to enter the MIS and
remain there forever. This doesn’t happen always, so we might have to
have some processes go through the whole O(log2 N ) initial rounds of
the algorithm more than once before the MIS converges. But O(log n)
attempts turn out to be enough to make it work in the end.
Appendix
369
Appendix A
Assignments
370
APPENDIX A. ASSIGNMENTS 371
Solution
1. This case is impossible. The proof is the same as for leader election in
an anonymous ring. In a synchronous execution, symmetry is never
372
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 373
broken, and so if any process returns, all processes return the same value.
This either yields S = ∅ (not maximal) or S = V (not independent).
B.1.2 Deanonymization
Suppose you have an asynchronous bidirectional message-passing network in
the form of an arbitrary connected graph, which is mostly anonymous in the
sense that every node but one runs the same code, and that each node can
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 374
1 initially do
2 if I am the leader then
3 position ← 0
4 send 0 clockwise
5 upon receiving m do
6 if I am not the leader then
7 position ← m + 1
8 send m + 1 clockwise
Solution
We’ll show that it is possible by constructing an algorithm.
First observe that we can apply the alpha synchronizer to this this system,
since the alpha synchronizer only requires that a node be able to detect
when it has received a message (or noMsg from each of its neighbors, and
the assumptions on port numbers are sufficient to do this. We also don’t
care about message complexity. So we can simplify our life by assuming that
the model is synchronous. (Alternatively we can replace the synchronous
breadth-first search protocol in the first step below with an asynchronous
breadth-first search protocol, but the end result is pretty much the same
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 375
either way.)
Run a synchronous breadth-first search protocol to construct a shortest-
path tree rooted at the initiator. This takes O(D) time and yields a tree with
depth at most D. Note that the parent pointers in the usual protocol will
now have port numbers rather than ids, but this doesn’t affect the algorithm.
Using convergecast, compute the size of every subtree and have each node
pass this information on to its parent. This takes an additional O(D) time.
We can now recursively assign ids through the tree. The initiator starts
the process by sending itself a message containing the id range {1 . . . n}. Each
node that receives an id range {i . . . j} assigns i to itself and then partitions
the remaining range {i + 1 . . . j} into subranges {i1 . . . j1 } , . . . {ik . . . jk },
where k is the number of children it has and each range has length equal to
the number of nodes at the subtree rooted at the corresponding child when
sorted by port number. Now send each child its range. A straightforward
induction argument shows that this assigns a unique identifier to every node.
The time to perform this broadcast-like operation is proportional to the
depth of the tree, giving another O(D) time. So the total time for all steps
is O(D).
You should justify your answers with matching upper and lower bounds.
For the upper bound side, you may find it helpful to give a single algorithm
that applies to both cases.
Solution
For our algorithm, we’ll run Dolev-Strong (see §9.2) largely unmodified,
except that we will run for 2t + 2 rounds and each process will send messages
only to its neighbors.
Observe that (a) if t ≤ n − 1, there is at least one non-faulty pi and at
least one non-faulty qj ; and (b) since we can divide the rounds into t + 1
phases of two rounds each, there are at least two consecutive rounds 2s and
2s + 1 with no new crash failures in either round. Let hk, vi appear in Spi
at the beginning of round 2s, where pi has not yet crashed. Then hk, vi is
transmitted to all surviving qj in round 2s, and at least one such qj forwards
hk, vi to all surviving pi in round 2s + 1. Similarly, any hk, vi that appears
in Sqj at the beginning of round 2s is transmitted to all pi and qj 0 by the
end of round 2s + 1. It follows that Sp2s+1
i
= Sq2s+1
j
for all pi and qj that
do not crash in round 2s + 1 or earlier. The same argument as used for
the original algorithm shows that this continues to hold for all subsequent
rounds, and so all processes choose the same value from the same set at the
end of the protocol, giving agreement. Termination is trivial as usual, and
validity follows from the same argument as for the original algorithm.
This shows that consensus is possible in O(t) time when t ≤ n − 1. Now
we just need the corresponding lower bounds.
2. For the time bound, suppose that we can solve the problem in t rounds.
Then in a complete network we can also solve synchronous agreement
with t failures in t rounds, since nothing prevents the processes in
the complete network from choosing only to communicate using a
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 377
Solution
It turns out that the diameter is not important. There exists a protocol with
responsiveness Θ(1), which is also the lower bound.
We’ll start by showing that no protocol with n ≥ 2 can have responsiveness
less than 1.
Consider a synchronous execution Ξ, and suppose that there are two
consecutive special actions si and si+1 such that the time between si and
si+1 is less than 1. Then si and si+1 are not causally ordered, and there is
a causal shuffle Ξ0 of Ξ in which si+1 occurs before si but the other special
actions occur in the same order as before. Let pi and pi+1 be the processes
execution si and si+1 . Then in Ξ0 , pi executes both the (i + 1)-th special
action and the (i + n)-th special action, which is requires pi to have both
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 378
1 for ever do
2 if I am not the root then
3 wait to receive token from my parent
4 if My depth is even then
5 perform special action
6 for each child c in increasing order by id do
7 send token to c
8 wait to receive token from c
9 if My depth is odd then
10 perform special action
first three events along edges of the tree. Classify these messages as D or U
depending on whether the message goes down the tree (is sent to a child) or
goes up (is sent to a parent). There are eight possible patterns DDD, DDU ,
DU D, . . . , U U U for the three messages.
Solution
We can solve the problem for t < n/3, by simulating any standard syn-
chronous Byzantine agreement algorithm with optimal fault tolerance, with
an extra round at the end to handle processes that have evil twins but still
need to decide on the common value.
To enforce synchrony, we use the alpha synchronizer. Since every good
process sends a message to every other process in every simulated round,
nobody gets stuck waiting for messages from all other processes, and the
worst that happens is that some process might receive a round-r message
from p̌i instead of pi . In this case we treat pi as Byzantine for the simulated
execution.
We will also assume that any pi with an evil twin behaves arbitrarily
during the main protocol. This absolves us from worrying about good
processes sending bad messages, and again bad messages from a twinned pi
are indistinguishable from bad messages from a Byzantine pi in the simulated
execution.
Running EIG or a similar algorithm then gives agreement among all
the processes that do not have evil twins. We add one more round where
each process announces its decision value, and all good processes (including
twinned processes) wait to receive decision values from all n processes and
decide on the majority. Since at least 23 n processes agree coming out of the
Byzantine agreement protocol, all good processes will see the same majority
value and reach the same decision.
Solution
The largest number of crash failures we can tolerate is t = n−1. At t = n, it is
possible for every process to crash immediately, erasing all inputs. Since this
gives the same configuration in both an all-0-input and all-1-input execution,
whatever the processes decide will validity in one of these executions.
To solve consensus with t = n − 1, we’ll first show how to simulate a
system with standard crash failures and a perfect failure detector, then adapt
the consensus protocol from Chandra and Toueg [CT96] for the strong failure
detector (see Algorithm 13.2). to solve the problem in the crash-with-recovery
model.
The idea is that whenever a process recovers, it will send a message failed
to all other processes, and otherwise act like a crashed process by no longer
participating in the simulated consensus protocol. A never-crashed process
that receives a failed message from some process p will (a) add p to its list of
suspect processes; and (b) send p its decision value, if it has already decided,
or add p to a list of processes to be notified of its decision value when it
decides, if it has not already decided. A previously-crashed process that is
notified of a decision value decides on that value. Other than these changes,
the never-crashed processes run Algorithm 13.2 essentially unmodified.
Agreement follows from the fact that all never-crashed processes agree
in Algorithm 13.2 and all crashed processes that decide choose a value
sent to them by a never-crashed process. Validity follows from validity of
Algorithm 13.2 and the same argument.
Termination is a bit trickier since we have to allow for the possibility
that a process might crash more than once. Any process that doesn’t crash
decides at the end of Algorithm 13.2 (but note that it may still need to
respond to failed messages). For a process p that does crash, consider what
happens when it recovers for the last time. At this point the process sends
failed to all processes, including at least one process q that does not crash.
Eventually q sends a value to p (either immediately in response to p’s message
or eventually when it decides). This value is sent after p’s last crash, so
eventually p receives it and decides.
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 382
1. Show that Algorithm B.3 can violate both mutual exclusion and
deadlock-freedom.
2. Prove or disprove: For any algorithm, if (a) it uses only one fetch-and-
add object and no other objects and (b) it works for an arbitrarily
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 383
Solution
1. Since I am lazy I will give a single execution that violates both mutual
exclusion and deadlock-freedom.
Send in K +2 processes p0 . . . pK+1 , and have all of them execute Line 2
in order. Then each process pi gets ticket i mod K, and in particular
p1 and pK+1 both get 1. This is unfortunate, because br/Kc = 1, so
both of these processes leave the loop in Line 3 and enter the critical
section together. Mutex is violated!
Even worse, since r never decreases, poor process p0 can never see
br/Kc = 0 and thus remains stuck at Line 3 forever. This is true even
if every other process runs to completion and makes no attempt to
re-enter the critical section. We haven’t actually shown that every
other process can run to completion, but we eventually reach some
configuration where either (a) every remaining process is stuck, or (b)
p0 is alone and stuck. In either case, deadlock-freedom is violated.
Conversely, these three lines are also the only places where c or d
change. Since we have already shown that they preserve r = c + d, the
invariant holds throughout any execution of the algorithm.
The invariant directly gives mutual exclusion: If in some configuration
there is already a process in the critical section, then r = c + d ≥ c ≥ 1
and so no process can observe r = 0 in Line 2 and enter the critical
section.
For deadlock-freedom we want to show that if there is at least one
process in the entry section, r eventually reaches 0 and stays there long
enough for some process to see it in Line 2. Start in any reachable
configuration. If c = 1, then we can run until the process in the critical
section leaves, reducing c to 0. Suppose that c remains 0 forever (if
not, some process entered the critical section and we are done). If r
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 385
never reaches 0, every process in the entry section eventually gets stuck
at Line 4. But then d = 0 implies r = 0, a contradiction. If instead r
reaches 0, then in that configuration no process is in Line 3, so every
process is either at Line 4 or Line 2. Processes in Line 4 see r = 0 and
move to Line 2; this does not change r. So eventually some process
executes Line 2, sees r − 0, and enters the critical section.
1 procedure write(`, v)
2 atomically do
3 if ` = ⊥ then ` ← v
4 procedure read(`)
5 atomically do
6 v←`
7 `←⊥
8 return v
Solution
The consensus number of this object is 2.
To solve consensus for n = 2, initialize the locker with some non-null
default value, say 1, and have each process attempt to read the locker after
writing its input to a register. Then whichever process gets 1 has won and
can return its own input, while the other process can read the winning input
from the winner’s register as usual.
To show we can do consensus for n = 3, we’ll use an argument similar to
that for queues without peek. Consider an alleged three-consensus protocol
using locker objects and atomic registers. Do the usual thing to get to a
bivalent configuration C with pending operations x and y on the same locker
object ` by processes p and q such that Cx is 0-valent and Cy is 1-valent.
Let z be a pending operation by the third process r. We have that Cr is
univalent but we don’t care about this for the purpose of the argument.
We want to show that for any choice of x and y, we can construct an
execution in which r can’t tell which of x and y went first. As usual we know
that x and y must be operations on the same object and that this object
must be a locker.
If x and y are both read operations, then Cxy and Cyx both leave an
empty locker and are indistinguishable to r.
If x is a write and y is a read, then we need to consider two cases
depending on whether the locker is empty in C or not. If the locker is empty,
then Cyx ∼r Cx, since in either case only q knows if y occurred or not. If the
locker is not empty, then Cxy ∼r Cy since x has no effect on a non-empty
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 387
Solution
To implement a writable max register r, we’ll use a standard bounded max
register mr to store lexicographically-ordered tuples hg, i, vi where g is a
generation number in {0 . . . w}, i is a process id in the range 0 . . . n − 1,
and v is a value in {0 . . . m − 1}. We can do this by encoding hg, i, vi as
mn · g + m · i + v, which is both bijective and order-preserving. To simplify
the presentation of the algorithm, we will treat this encoding as happening
implicitly. We assume that mr starts with its minimum value 0, corresponding
to the tuple h0, 0, 0i.
We can then increment the generation to reset the register in response
to write operations, and use the max-register property within a generation
to implement writeMax. Pseudocode for the resulting algorithm is given in
Algorithm B.6.
1 procedure read(r)
2 h−, −, vi ← read(mr )
3 return v
4 procedure write(r, v)
5 hg, −, −i ← read(mr )
6 writeMax(mr , hg + 1, myId, vi)
7 procedure writeMax(r, v)
8 hg, i, −i ← read(mr )
9 writeMax(mr , hg, i, vi)
Algorithm B.6: Writable max register
Validity For each position i and process p, yip is equal to some xqi .
saffron sanding
evening winding
windows winning
Solution
We’ll use a safe agreement object [BGLR01] (see §28.2) for each position i.
Since it takes a distinct failure to knock out each safe agreement object, at
most n − 1 of these objects will get stuck. So when a process p sees return
values from m − (n − 1) objects, it will combine these with its own inputs
for the missing positions to produce its output y p .
To avoid a lot of handwaving about how the safe agreement objects
interact, we’ll break the abstraction barriers around their implementations
and build an explicit loop for managing the unsafe phases. This also allows
us to skip looping in the safe phase. Pseudocode is given in Algorithm 15.
We claim that this satisfies all three requirements for k = 2n − 3.
Validity is easy. Any yip is either xpi or a proposal derived from some xqi .
Termination is also easy, since the algorithm contains no unbounded
loops.
For maximum distance, observe that the final snapshot s always contains
at least one level 2 proposal for each position, since every process that reaches
this line either observes a level 2 proposal in Line 5 or writes one in Line 8.
We can argue that any two such level 2 proposals that are used in Line 14
are equal, because if I take a snapshot that includes a level 2 proposal in
position 1 and no level 1 proposal, any process working on position i that
has not yet written a level 1 proposal will see the level 2 proposal and back
off instead of writing a new one. So the only places where y p and y q can
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 391
1 procedure vectorAgreement(x)
// unsafe phase of safe agreement for each i
2 for i ← 1 to m do
// propose xi at level 1 as in safe agreement
3 a[p]i ← h1, xi i
4 s ← snapshot(a)
5 if s contains a[q]i with level 2 then
// back off
6 a[p]i ← h0, xi i
7 else
// advance
8 a[p]i ← h2, xi i
15 return y
Algorithm B.7: Solution to vector agreement problem
APPENDIX B. SAMPLE ASSIGNMENTS FROM FALL 2023 392
1. An anonymous system in which all processes run the same code and
do not have unique IDs.
393
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 394
2. A uniform system with IDs, where uniformity means that the code for
each process depends only on its ID and not on the size of the system.
4. A uniform system with IDs, but where the broadcast channel is replaced
by an ordered broadcast channel that guarantees for each pair of
messages m1 and m2 , that if m1 is sent before m2 , each process receives
m1 before it receives m2 .
Solution
For computing message complexity, there is an ambiguity in the problem
description: does sending a single broadcast count as n messages or one
message? Below, we assume a broadcast counts as n messages, but one
message is also a reasonable interpretation, so either assumption is acceptable
as long as it is clear.
3. Possible. Have each process broadcast its ID then wait to collect n IDs.
The process with the smallest ID among these n IDs sets its leader bit.
Message complexity is n2 and time complexity is 1.
4. Possible. Have each process broadcast its ID. If a process receives its
own ID before any others, it sets its leader bit. Since the broadcast
channel is ordered, only the first process to do a broadcast wins.
Message complexity is n2 and time complexity is 1.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 395
Solution
For each r, let Sir be the value of Si after r rounds of messages. Define
Gr = (V, E r ) as the graph where V is the set of processes and ij ∈ E r if and
only if pj ∈ Si . From the definition we have G0 = G.
It is convenient to work with undirected graphs. Let H r be the undirected
graph that contains an edge ij if and only if ij and ji are both edges in Gr .
Note that H r is always a subgraph of Gr .
Claim: H 1 is connected. Proof: For each edge ij ∈ G0 , pi sends pi ∈ Si
to pj , so pj updates Sj1 to include ji. So H 1 contains the undirected version
of G0 as a subgraph. Since G0 is weakly connected, H 0 is connected.
Because H 1 is connected, there is a path in H 1 between any two nodes,
and the diameter d(H 1 ) of H 1 is at most n − 1. We now show that each
round of the protocol reduces the diameter of H by roughly half.
Claim: If uv and vw are both edges in H r , then uw is an edge in H r+1 .
Proof: From the definition of H r , we have {u, w} ⊆ Svr . So both of u and w
add the other upon receiving Svr from v.
Now consider arbitrary u, v ∈ H r with d(u, v) = m. This means that
there is a path u = u0 u1 . . . um = v in H r . From the claim, we have
that u = u0 u2 u4 . . . um = v is a path in H r+1 if m is even, and u =
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 396
Solution
1. One round is enough. Each process sends xi to all processes (including
itself), and each process returns yi equal to the largest of all xj it
received.
Condition (a) follows immediately from yi being equal to some xj . For
(b), if pj is non-faulty, pi receives xj from pj , so it returns either xj or
some larger xj 0 .
For the lower bound, if a protocol uses zero rounds, then no messages
are sent. If process pi decides xi in some execution, then for n ≥ 2
there exists an execution indistinguishable to pi from this one, where
some non-faulty pj with j 6= i has xj > xi , violating (b). Similarly, if
pi decides a value yi 6= xi , there exists an indistinguishable execution
where no process has yi as its input value, violating (a).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 397
2. Here we need f + 1 rounds. For the lower bound we can reduce from
synchronous consensus and apply Dolev-Strong ([DS83]; see also §9.3).
To solve consensus using this problem, have each process pi decide on yi .
This satisfies validity from (a) and agreement from the added condition
that yi = yj for all non-faulty i and j. So if we have an algorithm that
uses less than f + 1 rounds, we get an algorithm for consensus that
also uses less than f + 1 rounds, contradicting the known lower bound
for consensus.
For the upper bound, we can use the flooding mechanism from Dolev-
Strong ([DS83]; see also §9.2). This guarantees that after f + 1 rounds,
every non-faulty process obtains the same set S of input values, which
includes the inputs of all non-faulty processes. So taking max S gives
a common return value for all non-faulty processes that satisfies both
(a) and (b).
Solution
Possible. The idea is to reduce the problem to four processes of which at
most one is Byzantine, then use any Byzantine agreement algorithm that
tolerates f < n/3 Byzantine faults to solve agreement. One possibility would
be exponential information gathering [PSL80] (see §10.2.1), since we don’t
particularly care about anything but fault tolerance and 4 is a constant
anyway.
For each color group, let the process with maximum ID represent the group
(this does not require any rounds of communication under the assumption
that all IDs are known to all processes). We then have four representatives
that can execute EIG in f + 1 = 2 rounds to solve Byzantine agreement
among themselves. Each representative then broadcasts its decision value to
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 398
all n processes, and each non-faulty process decides on the value broadcast
by the majority of representatives. (Note that it is not enough for a process
to follow its own representative, because there may be non-faulty processes
within the faulty group.)
We would like to show that this algorithm solves Byzantine agreement
for all n processes. Termination is immediate. For validity, if all non-faulty
processes have the same input v, then so do the three non-faulty representa-
tives; validity in the four-process protocols implies that all three non-faulty
representatives broadcast this value and thus all non-faulty processes decide
it. Agreement is similar: because all three non-faulty representatives agree
on the same value v, each non-faulty process will see a majority for v and
decide on v.
Solution
1. Disproof: With no failure detector, consider two executions of a two-
process system. In one execution, process p1 takes no steps because
it crashes immediately. In the other, p1 takes no steps for a very long
time.
If p2 eventually sets c2 to 1, this violations c2 ≥ n − f in the execution
where p1 has not crashed.
If p2 does not eventually set c2 to 1, this violations c2 converging to
n − f in the execution where p1 has crashed.
2. Disproof: Consider the two executions in the previous case, and suppose
that ♦P correctly suspects p1 throughout the crash execution and
incorrectly suspects p1 in the no-crash execution.
If p2 sets c2 to 1, it violates (a) again in the no-crash execution, and
afterwards we can both wake up p1 and have ♦P stop suspecting p1 .
If p2 doesn’t set c2 to 1, it violates (b) in the crash execution.
3. Proof: Recall that P eventually permanently suspects every crashed
process and never suspects a process before it crashes. So have each
process pi set ci to n − fi , where fi is the number of processes that
pi ’s instance of P currently suspects. Because P only suspects crashed
processes, fi ≤ f and thus ci = n − fi ≥ n − f , satisfying (a). Because
P eventually permanently suspects all crashed processes, once every
process that will crash has crashed, P will eventually suspect all of
them at each pi . This gives fi = f and ci = n − fi = n − f .
Solution
We can do this when n ≥ 4f + 1 by modifying ABD (see §17.2).
To make things easier, we will assume that the honest servers keep track
of every timestamp-value pair ht, vi they have every received, instead of just
the one with the maximum timestamp. Upon receiving a read(u) message,
the server responds with its entire list (including ht, vi if it wasn’t there
already).
To perform a write operation with value v, the writer increments its
local timestamp t, sends write(t, v) to all servers, and waits for n − f
acknowledgments.
To perform a read operation, a reader sends read(u) to all servers, waits
for n − f replies, and then chooses a pair ht, vi that (a) is sent by at least
f + 1 servers, and (b) has the largest t out of all such pairs. If there is no pair
sent by f + 1 servers, the reader returns the default initial register value ⊥.
Otherwise, it sends write(t, v) to all servers, waits for n−f acknowledgments,
then returns v.
To show this gives a linearizable implementation of a single-writer multi-
reader register, we will largely follow the original proof for ABD, constructing
an explicit linearization of any complete execution. We start with a simple
invariant:
Lemma C.3.1. Let ht, vi be a pair that is (a) in some honest server’s list,
(b) in a write(t, v) message, or (c) adopted by a reader. Then ht, vi was
previously sent by the writer.
Proof. It is easy to see that if (b) and (c) hold in some configuration, then
(a) and (b) hold in any successor configuration, since we can only add a
tuple to an honest server if it was in a write(t, v) message and we can only
generate a write(t, v) message if ht, vi is sent by the writer or was previously
adopted by a reader. To show that (c) holds, observe that if a reader adopts
ht, vi, it must first receive it from f + 1 servers. At least one of these servers
is honest, so (a) applies.
For any operation a, let t(a) be the timestamp of the pair ht, vi that a
sends in its write(t, v) messages. Observe that if a finishes, then it receives
acknowledgements from n − f servers of which at least n − 2f are not faulty:
this implies that by the time a finishes, at least n − f servers have ht, vi in
their lists. If b is a read operation with a <H b, then b receives responses
from at least n − 3f of these servers. With n ≥ 4f + 1, this is at least f + 1.
So b either adopts ht, vi or adopts some other ht0 , v 0 i with t0 > t. So whenever
a <H b, t(a) ≤ t(b).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 401
To define <S , a before b if (1) t(a) < t(b) (which we’ve just shown is
consistent with <H ); or (2) t(a) = t(b), a is a write, and b is a read (which
is consistent with <H by Lemma C.3.1); or (3) t(a) = t(b), both operations
are reads, and a <H b (definitely consistent with <H !). Then extend the
resulting partial order to a total order. As in the original ABD algorithm, we
get a sequence of blocks of operations where all operations in a block have
the same ht, vi pair, and the first operation in each block (except possibly
the first block) is a write of v and the rest are reads that return v. So the
resulting sequential execution is consistent both with H and the specification
of a register, and we have shown that the implementation is linearizable.
Solution
Proof: We’ll show that an unsigned arithmetic register implements consensus
for any fixed number of processes n, then use universality of consensus to
get an implementation of a signed arithmetic register.
The consensus construction follows a similar argument of Ellen et al. [EGSZ20]
for registers supporting multiplication and decrement, but we have to be a
little careful to only use non-negative values. Start with a single unsigned
arithmetic register r initialized to 1. A process with input 0 applies add(1)
to r. A process with input 1 applies multiply(n + 2) to r, where n is the
number of processes.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 402
Solution
Proof: In fact, we can do this for any n, not just n = 4.
Use two of the three registers to build a splitter (Algorithm 18.6). The
third register, initially 0, will be a flag indicating at least two increments.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 403
shared data:
1 atomic register race, big enough to hold an ID, initially ⊥
2 atomic register door, big enough to hold a bit, initially open
3 atomic register flag, big enough to hold a bit, initially 0
4 procedure increment(id)
5 race ← id
6 if door = closed then
7 flag ← 1
8 door ← closed
9 if race 6= id then
10 flag ← 1
11 procedure read
12 if door = open then
13 return 0
14 else if flag = 0 do
15 return 1
16 else
17 return 2
flag, assign its linearization point to the step where the door closes (whether
I1 closes the door or not). Then I1 linearizes between all reads that return 0
and all reads that return 1 or 2. Because every other increment loses the
splitter, every other increment sets the flag; make each such increment’s
linearization point be the step where it sets the flag. The first such increment
I2 linearizes between all reads that return 0 or 1 and all reads that return 2.
If no increment wins the splitter, then no increment finishes before setting
the flag, at the point where the flag is first set there are at least two increments
in progress and none have already finished. Let I1 be one of these increments
that starts before the door closes, and assign its linearization point to the
step where the door closes. Let I2 be any other increment in progress when
the flag is first set, and assign its linearization point to when the flag is first
set. Assign the linearization points of any other increment anywhere during
its execution interval that is after I2 ’s. Again we get one increment linearized
between the 0 and 1 reads, and at least one between the 1 and 2 reads. We
are done.
1. Show that O(n2 ) total operations are sufficient to increment the counter
at least n2 times.
2. Show that T (n) total operations are sufficient to increment the counter
at least n times, for some T (n) = o(n2 ).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2022 405
Solution
1. We’ll have each process i alternate between incrementing the counter
and writing out the total number of increments it has done so far to
its register ri .
After n increments, the process will read all the registers rj , and if
rj ≥ n2 , return.
P
1. Show that no algorithm that allows tokens to move can guarantee that
there are exactly m tokens in any reachable configuration.
406
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 407
To keep things simple, you may assume that the processes can make
non-deterministic choices. For example, a process p might choose
arbitrarily between sending a message to a neighbor q or to a different
neighbor r, and each choice leads to a different possible execution.
Solution
1. Suppose that we are preserving total tokens. Consider some transition
between configurations C1 and C2 . If some process switches hasToken
from 1 to 0 between these configurations, then some other process must
switch hasToken from 0 to 1. But the definition of delivery events in
the asynchronous message-passing model only allows one process at a
time to change its state. It follows that no process can change hasToken
from 1 to 0 in any transition, so tokens can’t move.
1 initially do
2 if I am a leader then
3 parent ← id
4 send recruit to both neighbors
5 else
6 parent ← ⊥
Solution
1. In a synchronous execution, we can prove by induction that for each t
with 0 ≤ t ≤ k−12 , and each 0 ≤ i ≤ m − 1, each node at position ik ± t
joins the tree rooted at ik at time t. This puts exactly k nodes in each
tree.
3. The easiest fix may be to have each leader initially send just one recruit
message to the right. For each i, this recruits all agents ik, . . . , ik+(k−1)
to a tree of size k rooted at ik.
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 410
Solution
We’ll use the flooding algorithm of Dolev and Strong [DS83] (see §9.2), but
replace sending S to all n processes in each round with sending S to all nk
possible recipient lists. As in the original algorithm, we want to prove that
after some round with few enough failures, all the non-faulty processes have
the same set.
Let Sir be the set stored by process i after r rounds. Suppose there is
some round r + 1 in which fewer than k processes fail. Then every recipient
list in round r includes a process that does not fail in round r + 1. Let L
be the set of processes that successfully deliver a message to at least one
recipient list in round r, and let S = ∪i∈L Sir . Then for each value v ∈ S,
there is some process that receives v during round r, does not crash in round
r + 1, and so retransmits v to all processes in round r + 1, causing it to be
added to Sir+2 . On the other hand, for any v 6∈ S, v is not transmitted to any
recipient list in round r, which means that no non-faulty process i includes
v in Sir+1 . So S ⊆ Sir+2 ⊆ ∪j Sjr+1 ⊆ S for all i, and the usual induction
0
argument shows that Sir continues to equal S for all non-faulty i and all
r0 ≥ r + 2.
We can have at most bf /kc rounds with ≥ k crashes before we run out,
so the latest possible round in which we have fewer than k crashes is is
r = bf /kc + 1, giving agreement after bf /kc + 2 rounds (since we don’t need
to send any messages in round r + 2).
(With some tinkering, it is not too hard to adapt the Dolev-Strong lower
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 411
bound to get a bf /kc + 1 lower bound for this model. The main issue is
now we have to crash k processes fully in round r + 1 before we can remove
one outgoing broadcast from a process in round r, which means we need to
budget tk failures to break a t-round protocol. The details are otherwise
pretty much the same as described in §9.3.)
1 preference ← input
2 for i ← 1 to m do
3 send hi, preferencei to all processes
4 wait to receive hi, vi from n − f processes
5 for each hi, vi received do
6 preference ← min(preference, v)
7 decide preference
Algorithm D.2: Candidate algorithm for asynchronous agreement
Solution
We know from the FLP bound ([FLP85], Chapter 11) that Algorithm D.2
can’t work. So the only question is how to find an execution that shows it
doesn’t work.
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 412
It’s not too hard to see that Algorithm D.2 satisfies both termination
and validity. So we need to find a problem with agreement.
The easiest way I can see to do this is to pick a patsy process p and give it
input 0, while giving all the other processes input 1. Now run Algorithm D.2
while delaying all outgoing messages hi, vi from p until after the receiver has
finished the protocol. Because each other process is waiting for n − f ≤ n − 1
messages, this will not prevent the other processes from finishing. But all
the other processes have input 1, so we have an invariant that messages in
transit from processes other than p and preferences of processes other than p
will be 1 that holds as long as no messages from p are delivered. This results
in the non-p processes all deciding 1. We can then run p to completion, at
which point it will decide 0.
Solution
1. Termination: The algorithm always terminates in f + 1 synchronous
rounds, so f doesn’t matter.
2. Validity: To violate validity, we need to convince some non-faulty
process to decide on the wrong value when all non-faulty processes
have the same input.
Suppose all the non-faulty processes have input 0, and we want to
introduce a 1 somewhere. Each process updates its preference in
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 413
each round to be either the majority value it sees, if this value has
multiplicity greater than n/2 + f , or the kingMajority broadcast by the
phase king otherwise.
If f < n/2, it’s going to be hard to show a process a bogus majority.
But a Byzantine phase king gives us more options. Suppose that all
the f Byzantine processes send out 1 in all rounds. Then for f ≥ n/4,
the multiplicity of the correct value 0 will be n − f ≤ (3/4)n, while the
required multiplicity to ignore the phase king will be strictly greater
than n/2 + f ≥ (3/4)n. So at f = n/4, all non-faulty processes adopt
the phase king’s bad value 1. In any subsequent round, we can just run
the algorithm with the Byzantine agents pretending to be non-faulty
processes with preference 1, and eventually all processes incorrectly
decide 1.
Give an algorithm that solves this problem, and show that it satisfies
these requirements.
(For the purpose of defining when a process starts or ends the protocol,
imagine that it uses explicit invoke and respond events. Your protocol should
have the property that all non-faulty processes eventually terminate.)
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 414
Solution
The easiest way to do this may be to use ABD (see §17.2). Algorithm D.3
has each process read the simulated register, which we assume is initialized
to 1, then write a 0 before returning the value it read.
1. V = {0, 1}.
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 415
1 procedure inc(v)
2 A[i] ← A[i] + v
3 procedure read()
4 s←0
5 for j ← 1 to n do
6 s ← s + A[j]
7 return s
Algorithm D.4: An alleged counter. Code for process i.
2. V = {−1, 1}.
3. V = {1, 2}.
Solution
1. The {0, 1} case is linearizable. Given an execution S of Algorithm D.4,
we assign to a linearization point to each inc operation at the step
where it writes to A, and assign a linearization point to each read
operation ρ that returns s at the later of the first step that leaves
P
j A[j] = s or the first step of ρ. Since this may assign the same
linearization point to some write operation π and one or more read
operations ρ1 , . . . , ρk , when this occurs, we order the write before the
reads and the reads arbitrarily.
Observe that:
These are easily shown by induction on the steps of the execution, since
each inc operation only changes at most one A[j] and only changes it
by increasing it by 1.
The first condition implies that the value vj of A[j] used by a particular
read operation ρ lies somewhere between the minimum and maximum
values of A[j] during the operation’s interval, which implies the same
P
about the total j A[j]. In particular, if ρ returns s the value of
P
j A[j] is no greater than s, and it reaches s no later than the end of
ρ.
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 416
P
Because j A[j] increases by at most one per step, this means that
P P
either j A[j] = s at the first step of ρ, or j A[j] = s at some step
within the execution interval of ρ. In either case, ρ is assigned an
execution point within its interval that follows exactly s non-trivial
increments. This means that the return values of all read operations
are consistent with a sequential generalized counter execution, and
because both read and inc operations are ordered consistently with
the execution ordering in S, we have a linearization of S.
2. For increments in {−1, 1}, there are executions of Algorithm D.4 that
are not linearizable. We will construct a specific bad execution for
n = 3. Let p1 perform inc(1) and p2 perform inc(2), where p1 finishes
its operation before p2 starts. Because the inc(1) must be linearized
before the inc(−1), the values of the counter in any linearization will
be 0, 1, 0 in this order.
Now add a read operation by p3 that is concurrent with both inc
operations. Suppose that in the execution, the follow operations are
performed on the registers A[1] through A[3]:
(a) p3 reads 0 from A[1].
(b) p1 writes 1 to A[1].
(c) p2 writes −1 to A[2].
(d) p3 reads −1 from A[2].
(e) p3 reads 0 from A[3].
Now p3 returns −1. There is no point in the sequential execution at
which this is the correct return value, so there is no linearization of
this execution.
3. For increments in {1, 2}, essentially the same counterexample works.
Here we let p1 do inc(1) and p2 do inc(2), while p3 again does a
concurrent read. The bad execution is:
(a) p3 reads 0 from A[1].
(b) p1 writes 1 to A[1].
(c) p2 writes 2 to A[2].
(d) p3 reads 2 from A[2].
(e) p3 reads 0 from A[3].
Now p3 returns 2, but in any linearization of the two write operations,
the values in the counter are 0, 1, 3.
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 417
D.4.2 Rock-paper-scissors
Define a rock-paper-scissors object as having three states 0 (rock), 1
(paper), and 2 (scissors), with a read operation that returns the current
state and a play(v) operation for v ∈ {0, 1, 2} that changes the state from s
to v if v = (s + 1) (mod 3) and has no effect otherwise.
Prove or disprove: There exists a deterministic wait-free linearizable
implementation of a rock-paper-scissors object from atomic registers.
Solution
Proof: We will show how to implement a rock-paper-scissors object using
an unbounded max register, which can be built from atomic registers using
snapshots. The idea is to store a value v such that v mod 3 gives the value
of the rock-paper-scissors object. Pseudocode for both operations is given in
Algorithm D.5.
6 procedure read()
7 return (m mod 3)
Algorithm D.5: Implementation of a rock-paper-scissors object
Linearize each play operation that does not write m at the step at which
it reads m.
Linearize each play operation that writes s + 1 to m at the first step
at which m ≥ s + 1. If this produces ties, break first in order of increasing
s + 1 and then arbitrarily. Since each such operation has m ≤ s when the
operation starts and m ≥ s + 1 when it finishes, these linearization points fit
within the intervals of their operations.
Linearize each read() operation at the step where it reads m.
Since each of these linearization points is within the corresponding oper-
ation’s interval, this preserves the observed execution ordering.
Observe that the play operations that write are linearized in order of
increasing values written, and there are no gaps in this sequence because
no process writes s + 1 without first seeing s. (This actually shows there is
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 418
Solution
We’ll disprove it.
Let p0 and p1 be the two processes. The idea is to consider, for each
i ∈ {0, 1} some nonzero-probability solo terminating execution ξi of pi with
input i, then show that ξ0 and ξ1 can be interleaved to form a two-process
execution ξ that is indistinguishable by each pi from ξi .
The oblivious adversary will simply choose to schedule the processes for
ξ. Since the processes flip a finite number of coins in this execution, there
is a nonzero chance that the adversary gets lucky and they flip their coins
exactly the right way.
Fix ξ0 and ξ1 as above. Partition each ξi as αi βi1 βi2 . . . βiki where αi
contains only read operations and each βij starts with a write operation of a
value vij strictly larger than any previous write operation.
Let ξ = α0 α1 βi1 j1 βi2 j2 . . . βik jk where k = k0 + k1 and the blocks βi` j` are
the blocks {β0j } and {β1j } sorted in order of non-decreasing vij . Then each
block βi` j` in ξ starts with a write of a value no smaller than the previous
APPENDIX D. SAMPLE ASSIGNMENTS FROM SPRING 2020 419
value in the max register, causing each read operation within the block to
return the value of this write, just as in the solo execution ξi` . Assuming
both processes flip their coins as in the solo executions, they both perform
the same operations and return the same values. These values will either
violate agreement in ξ or validity in at least one of ξ0 or ξ1 .
2. There is such an implementation that uses O(n) registers, but not o(n)
registers.
Solution
Case (2) holds.
To implement the object, use a snapshot array to hold the total votes
from each process, and have the winner operation take a snapshot, add up
all the votes and return the correct result. This can be done using n registers.
To show that it can’t be done with o(n) registers, use the JTT bound (see
Chapter 21). We need to argue that the object is perturbable. Let ΛΣπ be
an execution that needs to be perturbed, and let m be the maximum number
of vote(v) operations that start in Λ for any value v. Then a sequence γ of
m + 1 votes for some v 0 that does not appear in Λ will leave the object with v 0
as the plurality value, no matter how the remaining operations are linearized.
Since v 0 did not previously appear in Λ, this gives a different return value
for π in ΛγΣπ from ΛΣπ as required. The JTT bound now implies that any
implementation of the object requires at least n − 1 registers.
Appendix E
Solution
Time complexity Observe that Alice sends at least k messages by time
2k − 2. This is easily shown by induction on k, because Alice sends at least
1 message by time 0, and if Alice has sent at least k − 1 message by time
420
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2019 421
1 Alice:
2 initially do
3 send message to Bob
4 upon receiving message from Bob do
5 send message to Bob
6 Bob:
7 upon receiving message from Alice do
8 send message to Alice
9 send message to Charlie 1
10 Charlie i, for i < n:
11 initially do
12 c←0
13 upon receiving message from Bob or Charlie i − 1 do
14 c←c+1
15 if c = 3 then
16 c←0
17 send message to Charlie i + 1
Ti (k) = (2 · 3i · k − 1) + k.
differ in exactly one coordinate. We also assume that each node knows
its own coordinate vector and those of its neighbors.
Show that any algorithm for an asynchronous ring can be adapted to
an asynchronous d-dimensional hypercube with no increase in its time
or message complexity.
Solution
1. The idea is to embed the ring in the hypercube, so that each node is
given a clockwise and counterclockwise neighbors, and any time the ring
algorithm asks to send a message clockwise or counterclockwise, we send
to the appropriate neighbor in the hypercube. We can then argue that
for any execution of the hypercube algorithm there is a corresponding
execution of the ring algorithm and vice versa; this implies that the
worst-case time and message-complexity in the hypercube is the same
as in the ring.
It remains only to construct an embedding. For d = 0, d = 1, and d = 2,
the ring and hypercube are the same graph, so it’s easy. For larger
d, split the hypercube into two subcubes Qd−1 , consisting of nodes
with coordinate vectors of the form 0x and 1x. Use the previously
constructed embedding for d − 1 to embed a ring on each subcube,
using the same embedding for both. Pick a pair of matching edges
(0x, 0y) and (1x, 1y) and remove them, replacing them with (0x, 1x)
and (0y, 1y). We have now constructed an undirected Hamiltonian
cycle on Qd . Orient the edges to get a directed cycle, and we’re done.
initial start-up cost to map the graph, adding to the time and
message complexity of the ring algorithm.
Solution
This is pretty much the same as a Chandy-Lamport snapshot [CL85], as
described in §6.3. The main difference is that instead of recording its state
upon receiving a stop message, a process shuts down the underlying protocol.
Pseudocode is given in Algorithm E.2. We assume that the initial stop order
takes the form of a stop message delivered by a process to itself.
1 initially do
2 stopped ← false
3 upon receiving stop do
4 if ¬stopped then
5 stopped ← true
6 send stop to all neighbors
7 replace all events in underlying protocol with no-ops
Solution
We need f < n/2.
To show that f < n/2 is sufficient, observe that we can use the oracle to
construct an eventually strong (♦S) failure detector.
Recall that ♦S has the property that there is some non-faulty process
that is eventually never suspected, and every fault process is eventually
permanently suspected. Have each process broadcast the current value of its
leader oracle whenever it increases; when a process p receives i from some
process q, it stops suspecting q if i is greater than any value p has previously
seen, and starts suspecting all other processes. The guarantee that eventually
some non-faulty q gets a maximum value that never changes ensures that
eventually q is never suspected, and all other processes (including faulty
processes) are suspected. We can now use Algorithm 13.2 to solve consensus.
To show that f < n/2 is necessary, apply a partition argument. In
execution Ξ0 , processes n/2 + 1 through n crash, and processes 1 through
n/2 run with input 0 and with the oracle assigning value 1 to process 1 (and
no others). In execution Ξ1 , processes 1 through n/2 crashes, and processes
n/2 + 1 through n run with input 1 and with the oracle assigning value 2
to process n (and no others). In each of these executions, termination and
validity require that eventually the processes all decide on their respective
input values 0 and 1.
Now construct an execution Ξ2 , in which both groups of processes run
as in Ξ0 and Ξ1 , but no messages are exchanged between the groups until
after both have decided (which must occur after a finite prefix because this
execution is indistinguishable to the processes from Ξ0 or Ξ1 ). We now
violate agreement.
Solution
No such implementation is possible. The proof is by showing that if some
such implementation could work, we could solve asynchronous consensus with
1 crash failure, contradicting the Fischer-Lynch-Patterson bound [FLP85]
(see Chapter 11).
An implementation of consensus based on totally-ordered partial broad-
cast for k = 3n/4 is given in Algorithm E.3. In fact, k = 3n/4 is overkill
when f = 1; k > n/2 + f is enough.
1 first ← ⊥
2 for i ← 1ton do
3 count[i] ← 0
4 value[i] ← ⊥
5 broadcast hi, inputi
6 upon receiving hj, vi do
7 if first = ⊥ then
8 first ← hj, vi
9 send received(hj, vi) to all processes
Lemma E.2.1. In any execution of Algorithm E.3 with k > n/2 + f . there
is a unique pair hj, vi such that at least k − f non-faulty processes resend
received(hj, vi).
Proof. Because all processes that receive messages m1 and m2 through the
broadcast mechanism receive them in the same order, we can define a partial
order on messages by letting m1 < m2 if any process receives m1 before m2 .
There are only finitely many messages, so there is at least one pair hj, vi
that is minimal in this partial order. This message is received by at least k
processes, of which at least k − f are non-faulty. Each such process receives
hj, vi before any other broadcast messages, so it sets first to hj, vi and resends
received(hj, vi).
To show that hj, vi is unique, observe that k − f > n/2 implies that if
there is some other pair hj 0 , v 0 i that is resent by k − f non-faulty processes,
then there is some process that resends both hj, vi and hj 0 , v 0 i. But each
process resends at most one pair.
shared data:
1 waiting, atomic register, initially arbitrary
2 count, atomic counter, initially 0
3 Code for process i:
4 while true do
// trying
5 increment count
6 waiting ← i
7 while true do
8 if count = 1 then
9 break
10 if waiting = i + 1 (mod n) then
11 break
// critical
12 (do critical section stuff)
// exiting
13 decrement count
// remainder
14 (do remainder stuff)
Algorithm E.4: Peterson’s mutual exclusion algorithm using a counter
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2019 430
Solution
The proof that this works for two processes is essentially the same as in the
original algorithm. The easiest way to see this is to observe that process
pi sees count = 1 in Line 8 under exactly the same circumstances as it sees
present[¬i] = 0 in Line 8 in the original algorithm; and similarly with two
processes waiting is always set to the same value as waiting in the original
algorithm. So we can map any execution of Algorithm E.4 for two processes
to an execution of Algorithm 18.5, and all of the properties of the original
algorithm carry over to the modified version.
To show that the algorithm doesn’t work for three processes, we construct
an explicit bad execution:
1. p0 increments count
2. p1 increments count
3. p2 increments count
4. p0 writes 0 to waiting
5. p1 writes 1 to waiting
6. p2 writes 2 to waiting
Solution
One possible implementation is given in Algorithm E.5. This requires O(1)
space and O(1) steps per call to inc or read.
1 procedure inc
2 if c[1] = 1 then
// somebody already did inc
3 c[2] ← 1
4 else
5 c[1] ← 1
// maybe somebody else is doing inc
6 if splitter returns right or down then
7 c[2] ← 1
8 procedure read
9 if c[2] = 1 then
10 return 2
11 else if c[1] = 1 do
12 return 1
13 else
14 return 0
The implementation uses two registers c[1] and c[2] to represent the value
of the counter. Two additional registers implement a splitter object as in
Algorithm 18.6.1
Claim: For any two calls to inc, at least one sets c[2] to 1. Proof: Suppose
otherwise. Then both calls are by different processes p and q (or else the
second call would see c[1] = 1) and both execute the splitter. Since a splitter
returns stop to at most one process, one of the two processes gets right or
down, and sets c[2].
It is also straightforward to show that a single inc running alone will set
c[1] but not c[2], since in this case the splitter will return stop.
Now we need to argue linearizability. We will do so by assigning lineariza-
tion points to each operation.
If some inc does not set c[2], assign it the step at which it sets c[1].
Assign each other inc the step at which it first sets c[2].
If every inc sets c[2], assign the first inc to set c[1] the step at which it
does so, and assign all others the first point during its execution interval at
which c[2] is nonzero.
For a read operation that returns 2, assign the step at which it reads c[2].
For a read operation that returns 1, assign the first point in the execution
interval after it reads c[2] at which c[1] = 1. For a read operation that
returns 0, assign the step at which it reads c[1].
This will assign the same linearization point to some operations; in this
case, put incs before reads and otherwise break ties arbitrarily.
These choices create a linearization which consists of (a) a sequence of
read operations that return 0, all of which are assigned linearization points
before the first step at which c[1] = 1; (b) the first inc operation that
sets c[1]; (c) a sequence of read operations that return 1, all of which are
linearized after c[1] = 1 but before c[2] = 1; (c) some inc that is either
the first to set c[2] or spans the step that sets c[2]; and (d) additional inc
operations together with read operations that all return 2. Since each read
returns the minimum of 2 and the number of incs that precede it, this is a
correct linearization.
Solution
The worst-case step complexity of an operation is Θ(n).
For the upper bound, implement a counter on top of snapshots (or just
collect), and have read compute log∗ of whatever value is read.
For the lower bound, observe that a slow counter has the perturbability
property needed for the JTT proof. Given an execution of the form Λk Σk π
as described in Chapter 21, we can always insert some sequence of inc
operations between Λk and Σk that will change the return value of π. The
number of incs needed will be the number needed to raise log∗ v, plus an
extra n to overcome the possibility of pending incs in Σk being linearized
before or after π. Since this object is perturbable, and the atomic registers
we are implementing it from are historyless, JTT applies and gives an Ω(n)
lower bound on the cost of read in the worst case.
• The operation close(i, j) sets Ai to zero and adds the previous value
to Aj . It is equivalent to atomically executing transfer(i, j, read(i)).
Solution
1. The consensus number of the object is infinite. Initialize A0 to 1 and
the remaining Ai to 0. We can solve ID consensus by having process
i (where i > 0 execute close(0, i) and then applying read to scan all
the Aj values for itself and other processes. Whichever process gets
the 1 wins.
Solution
You will need exactly n registers (Θ(n) is also an acceptable answer).
For the upper bound, have each process write its ID to its own register,
and use a double-collect snapshot to read all of them. This uses exactly n
registers. The double-collect snapshot is wait-free because after each process
has called announce once, the contents of the registers never change, so read
finishes after O(n) collects or O(n2 ) register reads. It’s linearizable because
double-collect snapshot returns the exact contents of the registers at some
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2019 435
Give such a protocol and prove that it works, or show that no such
protocol is possible.
Solution
It turns out that this problem is a good example of what happens if you
don’t remember to include some sort of validity condition. As pointed in
several student solutions, having each process pick a fixed constant xi the
first time it updates works.
Here is a protocol that also works, and satisfies the validity condition that
the common output was some process’s input (which was not required in the
problem statement). When pi takes a step, it sets xi to max(xi , x(i−1) mod n ).
To show that this works, we argue by induction that the maximum value
eventually propagates to all processes. Let x = xi be the initial maximum
value. The induction hypothesis is that for each j ∈ {0, . . . , n − 1}, eventually
all processes in the range i through i + j (mod n) hold value x forever.
Suppose that the hypothesis holds for j; to show that it holds for j + 1,
start in a configuration where xi through xi+j are all x. No transition can
change any of these values, because taking the max of x and any other value
yields x. Because each process is scheduled infinitely often, eventually pi+j+1
takes a step; when this happens, xi+j+1 is set to max(x, xi+j+1 ) = x.
Since the hypothesis holds for all j ∈ {0, . . . , n − 1}, it holds for j = n−1;
but this just says that eventually all n processes hold x forever.
Solution
We need one round. Every process transmits its input to all processes,
including itself. From the all-or-nothing property, all processes receive the
same set of messages. From the assumption that some process is not faulty
in this round, this set is nonempty. So the processes can reach agreement by
applying any consistent rule to choose an input from the set.
Solution
The consensus number is 1.
Proof: We can implement it from atomic snapshot, which can be imple-
mented from atomic registers, which have consensus number 1.
For my first write(v) operation, write v to my component of the snapshot;
for subsequent write(v) operations, write fail. For a read operation, take
a snapshot and return (a) ⊥ if all components are empty; (b) v if exactly
one component is non-empty and has value v; and (c) fail if more than one
component is non-empty or any component contains fail.
Appendix F
1. Your name.
3. Whether you are taking the course as CPSC 465 or CPSC 565.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
438
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 439
where they have a cookie but child i + 1 does not, child i gives one cookie to
child i + 1. If child i + 1 already has a cookie, or child i has none, nothing
happens. We assume that a fairness condition guarantees that even though
some children are fast, and some are slow, each of them takes a step infinitely
often.
1. Show that after some finite number of steps, every child has exactly
one cookie.
Solution
1. First observe that in any configuration reachable from the initial con-
figuration, child 0 has k cookies, n − k of the remaining children have
one cookie each, and the rest have zero cookies. Proof: Suppose we
are in a configuration with this property, and consider some possible
step that changes the configuration. Let i be the child that takes the
step. If i = 0, then child i goes from k to k − 1 cookies, and child 1
goes from 0 to 1 cookies, increasing the number of children with one
cookie to n − k + 1. If i > 0, then child i goes from 1 to 0 cookies and
child i + 1 from 0 to 1 cookies, with k unchanged. In either case, the
invariant is preserved.
Now let us show that k must eventually drop as long as some cookie-less
child remains. Let i be the smallest index such that the i-th child has
no cookie. Then after finitely many steps, child i − 1 takes a step and
gives child i a cookie. If i − 1 = 0, k drops. If i − 1 > 0, then the
leftmost 0 moves one place to the left. It can do so only finitely many
times until i = 1 and k drops the next time child 0 takes a step. It
follows that after finitely many steps, k = 1, and by the invariant all
n − 1 remaining children also have one cookie each.
cookie i, let xti be the position of the i-th cookie after t asynchronous
rounds, where an asynchronous round is the shortest interval in which
each child takes at least one step.
Observe that no child j > 0 ever gets more than one cookie, since no
step adds a cookie to a child that already has one. It follows that cookie
0 never moves, because if child 0 has one cookie, so does everybody
else (including child 1). We can thus ignore the fact that the children
are in a cycle and treat them as being in a line 0 . . . n − 1.
We will show by induction on t that, for all i and t, xti ≥ yit =
max(0, min(i, zit )) where zit = t + 2(i − n + 1).
Proof: The base case is when t = 0. Here xti = 0 for all i. We also have
zit = 2(i − n + 1) ≤ 0 so yit = max(0, min(i, zit )) = max(0, zit ) = 0. So
the induction hypothesis holds with xti = yit = 0.
Now suppose that the induction hypothesis holds for t. For each i,
there are several cases to consider:
(a) xti = xti+1 = 0. In this case cookie i will not move, because it’s not
at the top of child 0’s stack. But from the induction hypothesis
we have that xti+1 = 0 implies zi+1 t = t + 2(i + 1 − n + 1) ≤ 0,
which gives zi = zi+1 − 2 ≤ −2. So zit+1 ≤ zi+1
t t t + 1 ≤ −1 and
yi = 0, and the induction hypothesis holds for xt+1
t+1
i .
(b) xti = i. Then even if cookie i doesn’t move (and it doesn’t), we
have xt+1
i ≥ xti ≥ min(i, zit ).
(c) xti < i and xti+1 = xti + 1. Again, even if cookie i doesn’t move, we
still have xt+1
i ≥ xti = xti+1 −1 ≥ yi+1
t −1 ≥ t+2(i+1−n+1)−1 =
t + 2(i − n + 1) + 1 > yi .t
(d) xti < i and xti+1 > xti + 1. Nothing is blocking cookie i, so it moves:
xit+1 = xti + 1 ≥ t + 2(i − n + 1) + 1 = (t + 1) + 2(i − n + 1) = yit+1 .
F.1.2 Eccentricity
Given a graph G = (V, E), the eccentricity (v) of a vertex v is the
maximum distance maxv0 d(v, v 0 ) from v to any vertex in the graph.
Suppose that you have an anonymous1 asynchronous message-passing
system with no failures whose network forms a tree.
1. Give an algorithm that allows each node in the network to compute its
eccentricity.
Solution
1. Pseudocode is given in Algorithm F.1. For each edge vu, the algorithm
sends a message d from v to u, where d is the maximum length of
any simple path starting with uv. This can be computed as soon as v
knows the maximum distances from all of its other neighbors u0 6= u.
1 initially do
2 notify ()
3 upon receiving d from u do
4 d[u] ← d
5 notify ()
6 procedure notify ()
7 foreach neighbor u do
8 if ¬notified[u] and d[u0 ] 6= ⊥ for all u0 6= u then
9 Send 1 + maxu0 6=u d[u0 ] to u
10 notified[u] ← true
algorithm computes the correct values, we will prove the invariant that
dv [u] ∈ {⊥, `v [u]} always, and for any message d in transit from u to v,
d = `v [u].
In the initial configuration, dv [u] = ⊥ for all v and u, and there are no
messages in transit. So the invariant holds.
Now let us show that calling notify at some process v preserves the
invariant. Because notify() does not change dv , we need only show
that the messages it sends contain the correct distances.
Suppose notify() causes v to send a message d to u. Then d = 1 +
maxu0 6=u dv [u0 ] = 1+maxu0 6=u `v [u0 ], because dv [u0 ] 6= ⊥ for all neighbors
u0 6= u by the condition on the if statement and thus dv [u0 ] = `v [u0 ] for
all u0 6= u by the invariant.
So the invariant will continue to hold in this case provided `u [v] =
1 + maxu0 6=u `v [u0 ]. The longest simple path starting with uv either
consists of uv alone, or is of the form uvw . . . for some neighbor w of v
with w 6= u. In the former case, v has no other neighbors u0 , in which
case d = 1 + maxu0 6=u `v [u0 ] = 1 + 0 = 1, the correct answer. In the
latter case, d = 1 + maxu0 6=u `v [u0 ] = 1 + `v [w], again the length of the
longest path starting with uv.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 443
This shows that notify preserves the invariant. We must also show
that assigning dv [u] ← d upon receiving d from u does so. But in this
case we know from the invariant that d = `v [u], so assigning this value
to dv [u] leaves dv [u] ∈ {⊥, `v [u]} as required.
3. First let’s observe that at most one message is sent in each direction
across each edge, for a total of 2|E| = 2(n − 1) messages. This is
optimal, because if in some execution we do not send a message across
some edge uv, then we can replace the subtree rooted at u with an
arbitrarily deep path, and obtain an execution indistinguishable to v
in which its eccentricity is different from whatever it computed.
For time complexity (and completion!) we’ll argue by induction on
`v [u] that we send a message across uv by time `v [u] − 1.
If `v [u] = 1, then u is a leaf; as soon as notify is called in its initial
computation event (which we take as occurring at time 0), u notices it
has no neighbors other than v and sends a message to v.
If `v [u] > 1, then since `v [u] = 1 + maxv0 6=v `u [v 0 ], we have `u [v]0 ≤
`v [u]−1 for all neighbors v 0 6= v of u, which by the induction hypothesis
means that each such neighbor v 0 sends a message to u no later than
time `v [u] − 2. These messages all arrive at u no later than time
`v [u] − 1; when the last one is delivered, u sends a message to v.
It follows that the last time a message is sent is no later than time
maxuv (`v [u] − 1), and so the last delivery event occurs no later than
time maxuv `v [u]. This is just the diameter D of the tree, giving a
worst-case time complexity of exactly D.
To show that this is optimal, consider an execution of some hypothetical
algorithm that terminates by time D − 1 in the worst case. Let u and
v be nodes such that d(u, v) = D. Then there is an execution of this
algorithm in no chain of messages passes from u to v, meaning that
no event of u is causally related to any event of v. So we can replace
u with a pair uw of adjacent nodes with d(w, v) = d(u, v) + 1, which
changes (v) but leaves an execution that is indistinguishable to v
from the original. It follows that v returns an incorrect value in some
executions, and this hypothetical algorithm is not correct. So time
complexity D is the best possible in the worst case.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 444
Solution
For sufficiency, ignore the extra edges and use Hirschberg-Sinclair [HS80]
(see §5.2.2).
For necessity, we’ll show that an algorithm that solves leader election in
this system using at most T (n) messages can be modified to solve leader
election in a standard ring without the extra edges using at most 3T (n)
messages. The idea is that whenever a process i attempts to send to i + 3,
we replace the message with a sequence of three messages relayed from i
to i + 1, i + 2, and then i + 3, and similarly for messages sent in the other
direction. Otherwise the original algorithm is unmodified. Because both
systems are asynchronous, any admissible execution in the simulated system
has a corresponding admissible execution in the simulating system (replace
each delivery event by three delivery events in a row for the relay messages)
and vice versa (remove the initial two relay delivery events for each message
and replace the third delivery event with a direct delivery event). So in
particular if there exists an execution in the simulating system that requires
Ω(n log n) messages, then there is a corresponding execution in the simulated
system that requires at least Ω(n log n/3) = Ω(n log n) messages as well.
1 procedure write(A, v)
2 atomically do
3 A[r] ← v; r ← (r + 1) mod n
4 procedure read(A)
5 return A[i]
Algorithm F.2: Rotor array: code for process i
Solution
First let’s show that it is at least 2, by exhibiting an algorithm that uses
a single rotor array plus two atomic registers to solve 2-process wait-free
consensus.
1 procedure consensus(v)
2 input[i] ← v
3 write(A, i)
4 i0 ← read(A)
5 if i0 = i then
// Process 0 wrote first
6 return input[0]
7 else
// Process 1 wrote first
8 return input[1]
The algorithm is given as Algorithm F.3. Each process i first writes its
input value to a single-writer register input[i]. The process then writes its
ID to the rotor array. There are two cases:
1. If process 0 writes first, then process 0 reads 0 and process 1 reads
1. Thus both processes see i0 = i and return input[0], which gives
agreement, and validity because input[0] is then equal to 0’s input.
2. If process 1 writes first, then process 0 reads 1 and process 1 reads
either 0 (if 0 wrote quickly enough) or ⊥ (if it didn’t). In either case,
both processes see i0 6= i and return input[1].
Now let us show that a rotor array can’t be used to solve wait-free
consensus with three processes. We will do the usual bivalence argument,
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 446
Solution
1. It’s probably possible to do this with some variant of ABD, but getting
linearizability when there are multiple concurrent insert operations
will be tricky.
Instead, we’ll observe that it is straightforward to implement a set
register using a shared-memory snapshot: each process writes to A[i]
the set of all values it has ever inserted, and a read consists of taking
a snapshot and then taking the union of the values. Because we can
implement snapshots using atomic registers, and we can implement
atomic registers in a message-passing system with f < n/2 crash failures
using ABD, we can implement this construction in a message-passing
system with f < n/2 failures.
2. This we can’t do. The problem is that an ordered set register can solve
agreement: each process inserts its input, and the first input wins. But
FLP says we can’t solve agreement in an asynchronous message-passing
system with one crash failure.
Solution
We can solve agreement using the k-bounded failure detector for n ≥ 2
processes if and only if f ≤ k and f < n/2.
Proof:
If k ≥ f , then every faulty process is eventually permanently suspected,
and the k-bounded failure detector is equivalent to the ♦S failure detector.
The Chandra-Toueg protocol [CT96] then solves consensus for us provided
f < n/2.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 448
1 procedure fetchAndMax(r, 0 : x)
2 if switch = 0 then
3 return 0 : fetchAndMax(left, x)
4 else
5 return 1 : fetchAndMax(right, 0)
6 procedure fetchAndMax(r, 1 : x)
7 v ← fetchAndMax(right, x)
8 if TAS(switch) = 0 then
9 return 0 : fetchAndMax(left, 0)
10 else
11 return 1 : v
Algorithm F.4 replaces the switch bit in the max register implementation
from Algorithm 22.2 with a test-and-set, and adds some extra machinery to
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 449
Solution
Here is a bad execution (there are others). Let k = 1, and let π1 do
fetchAndMax(01) and π2 do fetchAndMax(10). Run these operations concur-
rently as follows:
F.3.2 Median
Define a median register as an object r with two operations addSample(r, v),
where v is any integer, and computeMedian(r). The addSample operation
adds a sample to the multiset M of integers stored in the register, which
is initially empty. The computeMedian operations returns a median of this
multiset, defined as a value x with the property that (a) x is in the multiset;
(b) at least |M |/2 values v in the multiset are less than or equal to x; (c) at
least |M |/2 values v in the multiset are greater than or equal to x.
For example, if we add the samples 1, 1, 3, 5, 5, 6, in any order, then a
subsequent computeMedian can return either 3 or 5.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 450
Solution
For the upper bound, we can do it with O(n) registers using any linear-space
snapshot algorithm (for example, Afek et al. [AAD+ 93]). Each process stores
in its own segment of the snapshot object the multiset of all samples added
by that process; addSample just adds a new sample to the process’s segment.
For computeMedian, take a snapshot, then take the union of all the multisets,
then compute the median of this union. Linearizability and wait-freedom
of both operations are immediate from the corresponding properties of the
snapshot object.
For the lower bound, use JTT [JTT00]. Observe that both atomic
registers and resettable test-and-sets are historyless: for both types, the new
state after an operation doesn’t depend on the old state. So JTT applies if
we can show that the median register is perturbable.
Suppose that we have a schedule Λk Σk π in which Λk consists of an
arbitrary number of median-register operations of which at most k are
incomplete, Σk consists of k pending base object operations (writes, test-and-
sets, or test-and-set resets) covering k distinct base objects, and π is a read
operation by a process not represented in Λk Σk . We need to find a sequence
of operations γ that can be inserted between Λk and Σk that changes the
outcome of π.
Let S be the multiset of all values appearing as arguments to addSample
operations that start in Λk or Σk . Let x = max S (or 0 if S is empty), and let
γ consist of |S| + 1 addSample(r, x + 1) operations. Write T for the multiset
of |S| + 1 copies of x + 1. Then in any linearization of Λk γΣk π, the multiset
U of samples contained in r when π executes includes at least all of T and at
most all of S; this means that a majority of values in U are equal to x + 1,
and so the median is x + 1. But x + 1 does not appear in S, so π can’t
return it in Λk Σk π. It follows that a median register is in fact perturbable,
and JTT applies, which means that we need at least Ω(n) base objects to
implement a median register.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 451
1 procedure TAS(i)
2 myPosition ← 0
3 while true do
4 otherPosition ← read(a¬i )
5 x ← myPosition − otherPosition
6 if x ≡ 2 (mod m) then
7 return 0
8 else if x ≡ −1 (mod m) do
9 return 1
10 else if fair coin comes up heads do
11 myPosition ← (myPosition + 1) mod m
12 write(ai , myPosition)
1. An oblivious adversary?
2. An adaptive adversary?
Solution
For the oblivious adversary, we can quickly rule out m < 5, by showing that
there is an execution in each case where both processes return 0:
for any fixed k, because the coin-flips are uncorrelated with the oblivious
adversary’s choice of which process is fast). Then for k sufficiently large, the
fast process eventually sees a0 − a1 congruent to either 2 or −1 and returns.
Since this event occurs with independent nonzero probability in each interval
of length 2k, eventually it occurs.2
Once one process has terminated, the other increments myPosition in-
finitely often, so it too eventually sees a gap of 2 or −1.
For the adaptive adversary, the adversary can prevent the algorithm from
terminating. Starting from a state in which both processes are about to
read and a0 = a1 = k, run p0 until it is about to write (k + 1) mod m to a0
(unlike the oblivious adversary, the adaptive adversary can see when this will
happen). Then run p1 until it is about to write (k + 1) mod m to a1 . Let
both writes go through. We are now in a state in which both processes are
about to read, and a0 = a1 = (k + 1) mod m. So we can repeat this strategy
forever.
audience, you can assume that your listeners know at least everything
that we’ve talked about so far in the class.
3. A description of where this result fits into the literature (e.g., solves
an open problem previously proposed in [...], improves on the previous
best running time for an algorithm from [...], gives a lower bound or
impossibility result for a problem previously proposed by [...], opens
up a new area of research for studying [...]), and why it is interesting
and/or hard.
You do not have to prepare slides for your presentation if you would
prefer to use the blackboard, but you should make sure to practice it in
advance to make sure it fits in the allocated time. The instructor will be
happy to offer feedback on draft versions if available far enough before the
actual presentation date.
Relevant dates:
2016-04-22 Last date to send draft slides or arrange for a practice presen-
tation with the instructor if you want guaranteed feedback.
Solution
The consensus number of this object is 2.
For two processes, have each process i write its input to a standard
atomic register r[i], and then write its ID to a shared second-to-last-value
register s. We will have whichever process writes to s first win. After writing,
process i can detect which process wrote first by reading s once, because
it either sees ⊥ (meaning the other process has not written yet) or it sees
the identity of the process that wrote first. In either case it can return the
winning process’s input.
For three processes, the usual argument gets us to a configuration C
where all three processes are about to execute operations x, y, and z on
the same object, where each operation moves from a bivalent to a univalent
state. Because we know that this object can’t be a standard atomic register,
it must be a second-to-last register. We can also argue that all of x, y, and
z are writes, because if one of them is not, the processes that don’t perform
it can’t tell if it happened or not.
Suppose that Cx is 0-valent and Cy is 1-valent. Then Cxyz is 0-valent
and Cyz is 1-valent. But these configurations are indistinguishable to any
process but x. It follows that the second-to-last register can’t solve consensus
for three processes.
Solution
Here is an algorithm.
If there are two processes p and q with the same ID that are adjacent
to each other, they can detect this in the initial configuration, and transmit
this fact to all the other processes by flooding.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 456
If these processes p and q are not adjacent, we will need some other
mechanism to detect them. Define the extended ID of a process as its own
ID followed by a list of the IDs of its neighbors in some fixed order. Order
the extended IDs lexicographically, so that a process with a smaller ID also
has a smaller extended ID.
Suppose now that p and q are not adjacent and have the same extended
ID. Then they share the same neighbors, and each of these neighbors will see
that p and q have duplicate IDs. So we can do an initial round of messages
where each process transmits its extended ID to its neighbors, and if p and q
observe that their ID is a duplicate, they can again notify all the processes
to return that there are two leaders by flooding.
The remaining case is that p and q have distinct extended IDs, or that
only one minimum-process ID exists. In either case we can run any standard
broadcast-based leader-election algorithm, using the extended IDs, which will
leave us with a tree rooted at whichever process has the minimum extended
ID. This process can then perform convergecast to detect if there is another
process with the same ID, and perform broadcast to inform all processes of
this fact.
Solution
The implementation is correct.
If one process runs alone, it sets A[i][IDi ] for each i, sees 0 in door, then
sees 0 in each location A[i][¬IDi ] and wins. So we have property (a).
Now suppose that some process with ID p wins in an execution that may
involve other processes. Then p writes A[i][pi ] for all i before observing 0 in
door, which means that it sets all these bits before any process writes 1 to
door. If some other process q also wins, then there is at least one position i
where pi = ¬qi , and q reads A[i][pi ] after writing 1 to door. But then q sees
1 in this location and loses, a contradiction.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2016 457
shared data:
1 one-bit atomic registers A[i][j] for i = 0 . . . dlg ne − 1 and j ∈ {0, 1}, all
initially 0
2 one-bit atomic register door, initially 0
3 procedure splitter(ID)
4 for i ← 0 to k − 1 do
5 A[i][IDi ] ← 1
6 if door = 1 then
7 return lose
8 door ← 1
9 for i ← 0 to k − 1 do
10 if A[i][¬IDi ] = 1 then
11 return lose
12 return win
Algorithm F.6: Splitter using one-bit registers
Solution
Disproof by counterexample: Fix some f , and consider a graph with two
processes p0 and p1 connected by an edge. Let p0 start with 0 and p1 start
with 1. Then p0 ’s next state is f (0, 0, 1) = ¬f (1, 1, 0) 6= f (1, 1, 0), which is
p1 ’s next state. So either p0 still has 0 and p1 still has 1, in which case we
never make progress; or they swap their bits, in which case we can apply the
same analysis with p0 and p1 reversed to show that they continue to swap
back and forth forever. In either case the system does not converge.
Appendix G
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
459
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 460
evil, and knows the identities of all of its neighbors. However, the processes
do not know the number of processes n or the diameter of the network D.
Give a protocol that allows every process to correctly return the number
of evil processes no later than time D. Your protocol should only return a
value once for each process (no converging to the correct answer after an
initial wrong guess).
Solution
There are a lot of ways to do this. Since the problem doesn’t ask about
message complexity, we’ll do it in a way that optimizes for algorithmic
simplicity.
At time 0, each process initiates a separate copy of the flooding algorithm
(Algorithm 3.1). The message hp, N (p), ei it distributes consists of its own
identity, the identities of all of its neighbors, and whether or not it is evil.
In addition to the data for the flooding protocol, each process tracks a
set I of all processes it has seen that initiated a protocol and a set N of all
processes that have been mentioned as neighbors. The initial values of these
sets for process p are {p} and N (p), the neighbors of p.
Upon receiving a message hq, N (q), ei, a process adds q to I and N (q) to
N . As soon as I = N , the process returns a count of all processes for which
e = true.
Termination by D: Follows from the same analysis as flooding. Any
process at distance d from p has p ∈ I by time d, so I is complete by time D.
S
Correct answer: Observe that N = i∈I N (i) always. Suppose that there
is some process q that is not in I. Since the graph is connected, there is a
path from p to q. Let r be the last node in this path in I, and let s be the
following node. Then s ∈ N \ I and N 6= I. By contraposition, if I = N
then I contains all nodes in the network, and so the count returned at this
time is correct.
its neighbors as its parent, and following the parent pointers always gives a
path of minimum total weight to the initiator.1
Give a protocol that solves this problem with reasonable time, message,
and bit complexity, and show that it works.
Solution
There’s an ambiguity in the definition of total weight: does it include the
weight of the initiator and/or the initial node in the path? But since these
values are the same for all paths to the initiator from a given process, they
don’t affect which is lightest.
If we don’t care about bit complexity, there is a trivial solution: Use an
existing BFS algorithm followed by convergecast to gather the entire structure
of the network at the initiator, run your favorite single-source shortest-path
algorithm there, then broadcast the results. This has time complexity O(D)
and message complexity O(DE) if we use the BFS algorithm from §4.3. But
the last couple of messages in the convergecast are going to be pretty big.
A solution by reduction: Suppose that we construct a new graph G0
where each weight-2 node u in G is replaced by a clique of nodes u1 , u2 , . . . uk ,
with each node in the clique attached to a different neighbor of u. We then
run any breadth-first search protocol of our choosing on G0 , where each
weight-2 node simulates all members of the corresponding clique. Because
any path that passes through a clique picks up an extra edge, each path in
the breadth-first search tree has a length exactly equal to the sum of the
weights of the nodes other than its endpoints.
A complication is that if I am simulating k nodes, between them they
may have more than one parent pointer. So we define u.parent to be ui .parent
where ui is a node at minimum distance from the initiator in G0 . We also
re-route any incoming pointers to uj 6= ui to point to ui instead. Because ui
was chosen to have minimum distance, this never increases the length of any
path, and the resulting modified tree is a still a shortest-path tree.
Adding nodes blows up |E 0 |, but we don’t need to actually send messages
between different nodes ui represented by the same process. So if we use the
§4.3 algorithm again, we only send up to D messages per real edge, giving
O(D) time and O(DE) messages.
If we don’t like reductions, we could also tweak one of our existing
algorithms. Gallager’s layered BFS (§4.2) is easily modified by changing the
1
Clarification added 2014-01-26: The actual number of hops is not relevant for the
construction of the shortest-path tree. By shortest path, we mean path of minimum total
weight.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 462
Solution
√
The par solution for this is an Ω( f ) lower bound and O(f ) upper bound. I
don’t know if it is easy to do better than this.
For the lower bound, observe that the adversary can simulate an ordinary
crash failure by jamming a process in every round starting in the round it
crashes in. This means that in an r-round protocol, we can simulate k crash
failures with kr jamming faults. From the Dolev-Strong lower bound [DS83]
(see also Chapter 9), we know that there is no r-round protocol with k = r
crash failures faults, so there is√no r-round protocol with r2 jamming faults.
This gives a lower bound of b f c + 1 on the number of rounds needed to
solve synchronous agreement with f jamming faults.3
2
Clarifications added 2014-02-10: We assume that processes don’t know that they are
being jammed or which messages are lost (unless the recipient manages to tell them that a
message was not delivered). As in the original model, we assume a complete network and
that all processes have known identities.
3
Since Dolev-Strong only needs to crash one process per round, we don’t really need
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 463
For the upper bound, have every process broadcast its input every round.
After f + 1 rounds, there is at least one round in which no process is jammed,
so every process learns all the inputs and can take, say, the majority value.
Solution
The relevant bound here is the requirement that the network have enough
connectivity that the adversary can’t take over half of a vertex cut (see
§10.1.3). This is complicated slightly by the requirement that the faulty
nodes be contiguous.
The smallest vertex cut in a sufficiently large torus consists of the four
neighbors of a single node; however, these nodes are not connected. But we
can add a third node to connect two of them (see Figure G.1).
By adapting the usual lower bound we can use this construction to show
that f = 3 faults are enough to prevent agreement when m ≥ 3. The question
the full r jamming faults for processes that crash late. This could be used to improve the
constant for this argument.
4
Problem modified 2014-02-03. In the original version, it asked to compute f for all m,
but there are some nasty special cases when m is small.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 464
Solution
We can tolerate f < n/2, but no more.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 465
If f < n/2, the following algorithm works: Run Paxos, where each
process i waits to learn that it is non-faulty, then acts as a proposer for
proposal number i. The highest-numbered non-faulty process then carries
out a proposal round that succeeds because no higher proposal is ever issued,
and both the proposer (which is non-faulty) and a majority of accepters
participate.
If f ≥ n/2, partition the processes into two groups of size bn/2c, with
any leftover process crashing immediately. Make all of the processes in both
groups non-faulty, and tell each of them this at the start of the protocol.
Now do the usual partitioning argument: Run group 0 with inputs 0 with
no messages delivered from group 1 until all processes decide 0 (we can do
this because the processes can’t distinguish this execution from one in which
the group 1 processes are in fact faulty). Run group 1 similarly until all
processes decide 1. We have then violated agreement, assuming we didn’t
previously violate termination of validity.
Solution
First observe that ♦S can simulate ♦Sk for any k by having n − k processes
ignore the output of their failure detectors. So we need f < n/2 by the usual
lower bound on ♦S.
If f ≥ k, we are also in trouble. The f > k case is easy: If there exists
a consensus protocol for f > k, then we can transform it into a consensus
protocol for n − k processes and f − k failures, with no failure detectors at all,
by pretending that there are an extra k processes with real failure detectors
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 466
that crash immediately. The FLP impossibility result rules this out.
If f = k, we have to be a little more careful. By immediately crashing
f − 1 processes with real failure detectors, we can reduce to the f = k = 1
case. Now the adversary runs the FLP strategy. If no processes crash, then
all n − k + 1 surviving process report no failures; if it becomes necessary to
crash a process, this becomes the one remaining process with the real failure
detector. In either case the adversary successfully prevents consensus.
So let f < k. Then we have weak completeness, because every faulty
process is eventually permanently suspected by at least k − f > 0 processes.
We also have weak accuracy, because it is still the case that some process
is eventually permanently never suspected by anybody. By boosting weak
completeness to strong completeness as described in §13.2.3, we can turn
out failure detector into ♦S, meaning we can solve consensus precisely when
f < min(k, n/2).
Solution
No. We can adapt the lower bound on the session problem from §7.4.2 to
apply in this model.
Consider an execution of an algorithm for the session problem in which
each message is delivered exactly one time unit after it is sent. Divide
it as in the previous proof into a prefix β containing special actions and
a suffix δ containing no special actions. Divide β further into segments
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 467
Solution
This algorithm is basically implementing an array of ABD registers [ABND95],
but it omits the second phase on a read where any information the reader
learns is propagated to a majority. So we expect it to fail the same way ABD
would without this second round, by having two read operations return
values that are out of order with respect to their observable ordering.
Here is one execution that produces this bad outcome:
1 procedure inc
2 ci [i] ← ci [i] + 1
3 Send ci [i] to all processes.
4 Wait to receive ack(ci [i]) from a majority of processes.
5 upon receiving c from j do
6 ci [j] ← max(ci [j], c)
7 Send ack(c) to j.
8 procedure read
9 ri ← ri + 1
10 Send read(ri ) to all processes.
11 Wait to receive respond(ri , cj ) from a majority of processes j.
P
12 return k maxj cj [k]
13 upon receiving read(r) from j do
14 Send respond(r, ci ) to j
Algorithm G.1: Counter algorithm for Problem G.4.2.
Solution
It is not possible to implement this object using atomic registers.
Suppose that there were such an implementation. Algorithm G.2 im-
plements two-process consensus using a two atomic registers and a single
concurrency detector, initialized to the state following enter1 .
will equal process 1’s value, because process 2’s read follows its call to
enter2 , which follows exit1 and thus process 1’s write to r1 .
2. Process 1 executes exit1 after process 2 executes enter2 . Now both
exit operations return 1, and so process 2 returns its own value while
process 1 returns the contents of r2 , which it reads after process 2
writes its value there.
In either case, both processes return the value of the first process to access
the concurrency detector, satisfying both agreement and validity. This would
give a consensus protocol for two processes implemented from atomic registers,
contradicting the impossibility result of Loui and Abu-Amara [LAA87].
Solution
If n = 2, then a two-writer sticky bit is equivalent to a sticky bit, so we can
solve consensus.
If n ≥ 3, suppose that we maneuver our processes as usual to a bivalent
configuration C with no bivalent successors. Then there are three pending
operations x, y, and z, that among them produce both 0-valent and 1-valent
configurations. Without loss of generality, suppose that Cx and Cy are both
0-valent and Cz is 1-valent. We now consider what operations these might
be.
Solution
The necessary part is easier, although we can’t use JTT (Chapter 21) di-
rectly because having write operations means that our rotate register is not
perturbable. Instead, we argue that if we initialize the register to 1, we
get a mod-m counter, where increment is implemented by RotateLeft and
read is implemented by taking the log of the actual value of the counter.
Letting m ≥ 2n gives the desired Ω(n) lower bound, since a mod-2n counter
is perturbable.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 472
For sufficiency, we’ll show how to implement the rotate register using
snapshots. This is pretty much a standard application of known tech-
niques [AH90b, AM93], but it’s not a bad exercise to write it out.
Pseudocode for one possible solution is given in Algorithm G.3.
The register is implemented using a single snapshot array A. Each
entry in the snapshot array holds four values: a timestamp and process ID
indicating which write the process’s most recent operations apply to, the
initial write value corresponding to this timestamp, and the number of rotate
operations this process has applied to this value. A write operation generates
a new timestamp, sets the written value to its input, and resets the rotate
count to 0. A rotate operation updates the timestamp and associated write
value to the most recent that the process sees, and adjusts the rotate count
as appropriate. A read operation combines all the rotate counts associated
with the most recent write to obtain the value of the simulated register.
1 procedure write(A, v)
2 s ← snapshot(A)
3 A[id] ← hmaxi s[i].timestamp + 1, id, v, 0i
4 procedure RotateLeft(A)
5 s ← snapshot(A)
6 Let i maximize hs[i].timestamp, s[i].processi
7 if s[i].timestamp = A[id].timestamp and
s[i].process = A[id].process then
// Increment my rotation count
8 A[id].rotations ← A[id].rotations + 1
9 else
// Reset and increment my rotation count
10 A[id] ← hs[i].timestamp, s[i].process, s[i].value, 1i
11 procedure read(A)
12 s ← snapshot(A)
13 Let i maximize hs[i].timestamp, s[i].processi
14 Let
P
r = j,s[j].timestamp=s[i].timestamp∧s[j].process=s[i].process s[j].rotations
15 return s[i].value rotated r times.
Algorithm G.3: Implementation of a rotate register
Since each operation requires one snapshot and at most one update, the
cost is O(n) using the linear-time snapshot algorithm of Inoue et al. [IMCT94].
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 473
1 procedure TASi ()
2 while true do
3 with probability 1/2 do
4 ri ← ri + 1
5 else
6 ri ← ri
7 s ← r¬i
8 if s > ri then
9 return 1
10 else if s < ri − 1 do
11 return 0
1. Show that any return values of the protocol are consistent with a
linearizable, single-use test-and-set.
Solution
1. To show that this implements a linearizable test-and-set, we need to
show that exactly one process returns 0 and the other 1, and that if one
process finishes before the other starts, the first process to go returns
1.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 474
Suppose that pi finishes before p¬i starts. Then pi reads only 0 from
r¬i , and cannot observe ri < r¬i : pi returns 0 in this case.
We now show that the two processes cannot return the same value.
Suppose that both processes terminate. Let i be such that pi reads r¬i
for the last time before p¬i reads ri for the last time. If pi returns 0,
then it observes ri ≥ r¬i + 2 at the time of its read; p¬i can increment
r¬i at most once before reading ri again, and so observed r¬i < ri and
returns 1.
Alternatively, if pi returns 1, it observed ri < r¬i . Since it performs
no more increments on ri , pi also observes ri < r¬i in all subsequent
reads, and so cannot also return 1.
2. Let’s run the protocol with an oblivious adversary, and track the value
of r0t − r1t over time, where rit is the value of ri after t writes (to either
register). Each write to r0 increases this value by 1/2 on average, with
a change of 0 or 1 equally likely, and each write to r1 decreases it by
1/2 on average.
To make things look symmetric, let ∆t be the change caused by the
t-th write and write ∆t as ct + X t where ct = ±1/2 is a constant
determined by whether p0 or p1 does the t-th write and X t = ±1/2 is
a random variable with expectation 0. Observe that the X t variables
are independent of each other and the constants ct (which depend only
on the schedule).
For the protocol to run forever, at every time t it must hold that
r0t − r1t ≤ 3; otherwise, even after one or both processes does its
0 0
next write, we will have r0t − r1t and the next process to read will
terminate. But
t
X
r0t − r1t = ∆s
s=1
Xt
= (cs + Xs )
s=1
Xt t
X
= cs + Xs .
s=1 s=1
Solution
It’s not possible.
Consider an execution with n = 3 processes, each with input 0. If the
protocol is correct, then after some finite number of rounds t, each process
returns 0. By symmetry, the processes all have the same states and send the
same messages throughout this execution.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 476
Now consider a ring of size 2(t + 1) where every process has input 0,
except for one process p that has input 1. Let q be the process at maximum
distance from p. By induction on r, we can show that after r rounds of
communication, every process that is more than r + 1 hops away from p has
the same state as all of the processes in the 3-process execution above. So in
particular, after t rounds, process q (at distance t + 1) is in the same state
as it would be in the 3-process execution, and thus it returns 0. But—as it
learns to its horror, one round too late—the correct maximum is 1.
Solution
Test-and-sets are (a) historyless, and (b) have consensus number 2, so n is
at least 2.
To show that no historyless object can solve wait-free 3-process consensus,
consider an execution that starts in a bivalent configuration and runs to a
configuration C with two pending operations x and y such that Cx is 0-valent
and Cy is 1-valent. By the usual arguments x and y must both be operations
on the same object. If either of x and y is a read operation, then (0-valent)
Cxy and (1-valent) Cyx are indistinguishable to a third process pz if run
alone, because the object is left in the same state in both configurations;
whichever way pz decides, it will give a contradiction in an execution starting
with one of these configurations. If neither of x and y is a read, then x
overwrites y, and Cx is indistinguishable from Cyxto pz if pz runs alone;
again we get a contradiction.
Solution
Consider an execution in which the client orders ham. Run the northern
server together with the client until the server is about to issue a launch
action (if it never does so, the client receives no ham when the southern
server is faulty).
Now run the client together with the southern server. There are two
cases:
1. If the southern server ever issues launch, execute both this and the
northern server’s launch actions: the client gets two hams.
2. If the southern server never issues launch, never run the northern
server again: the client gets no hams.
In either case, the one-ham rule is violated, and the protocol is not
correct.5
5
It’s tempting to try to solve this problem by reduction from a known impossibility
result, like Two Generals or FLP. For these specific problems, direct reductions don’t
appear to work. Two Generals assumes message loss, but in this model, messages are not
lost. FLP needs any process to be able to fail, but in this model, the client never fails.
Indeed, we can solve consensus in the Hamazon model by just having the client transmit
its input to both servers.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2014 478
1 procedure mutex()
2 predecessor ← swap(s, myId)
3 while r 6= predecessor do
4 try again
// Start of critical section
5 ...
// End of critical section
6 r ← myId
Algorithm G.5: Mutex using a swap object and register
Solution
Because processes use the same ID if they try to access the mutex twice, the
algorithm doesn’t work.
Here’s an example of a bad execution:
2. Process 2 swaps 2 into s and gets 1, reads 1 from r, and enters the
critical section.
I believe this works if each process adopts a new ID every time it calls
mutex, but the proof is a little tricky.6
6
The simplest proof I can come up with is to apply an invariant that says that (a)
the processes that have executed swap(s, myId) but have not yet left the while loop have
predecessor values that form a linked list, with the last pointer either equal to ⊥ (if no
process has yet entered the critical section) or the last process to enter the critical section;
(b) r is ⊥ if no process has yet left the critical section, or the last process to leave the
critical section otherwise; and (c) if there is a process that is in the critical section, its
predecessor field points to the last process to leave the critical section. Checking the effects
of each operation shows that this invariant is preserved through the execution, and (a)
combined with (c) show that we can’t have two processes in the critical section at the same
time. Additional work is still needed to show starvation-freedom. It’s a good thing this
algorithm doesn’t work as written.
Appendix H
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
480
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 481
Solution
Disproof: Consider two executions, one in an n × m torus and one in an
m × n torus where n > m and both n and m are at least 2.2 Using the same
argument as in Lemma 5.1.1, show by induction on the round number that,
for each round r, all processes in both executions have the same state. It
follows that if the processes correctly detect n > m in the n × m execution,
then they incorrectly report m > n in the m × n execution.
H.1.2 Clustering
Suppose that k of the nodes in an asynchronous message-passing network
are designated as cluster heads, and we want to have each node learn the
identity of the nearest head. Given the most efficient algorithm you can for
this problem, and compute its worst-case time and message complexities.
You may assume that processes have unique identifiers and that all
processes know how many neighbors they have.3
Solution
The simplest approach would be to run either of the efficient distributed
breadth-first search algorithms from Chapter 4 simultaneously starting at all
cluster heads, and have each process learn the distance to all cluster heads
at once and pick the nearest one. This gives O(D2 ) time and O(k(E + V D))
messages if we use layering and O(D) time and O(kDE) messages using
local synchronization.
We can get rid of the dependence on k in the local-synchronization
algorithm by running it almost unmodified, with the only difference being
the attachment of a cluster head ID to the exactly messages. The simplest
way to show that the resulting algorithm works is to imagine coalescing
1
Clarification added 2011-09-28.
2
This last assumption is not strictly necessary, but it avoids having to worry about
what it means when a process sends a message to itself.
3
Clarification added 2011-09-26.
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 482
all cluster heads into a single initiator; the clustering algorithm effectively
simulates the original algorithm running in this modified graph, and the
same proof goes through. The running time is still O(D) and the message
complexity O(DE).
H.1.3 Negotiation
Two merchants A and B are colluding to fix the price of some valuable
commodity, by sending messages to each other for r rounds in a synchronous
message-passing system. To avoid the attention of antitrust regulators, the
merchants are transmitting their messages via carrier pigeons, which are
unreliable and may become lost. Each merchant has an initial price pA or
pB , which are integer values satisfying 0 ≤ p ≤ m for some known value
m, and their goal is to choose new prices p0A and p0B , where |p0A − p0B | ≤ 1.
If pA = pB and no messages are lost, they want the stronger goal that
p0A = p0B = pA = pB .
Prove the best lower bound you can on r, as a function of m, for all
protocols that achieve these goals.
Solution
This is a thinly-disguised version of the Two Generals Problem from Chap-
ter 8, with the agreement condition p0A = p0B replaced by an approximate
agreement condition |p0A − p0B | ≤ 1. We can use a proof based on the
indistinguishability argument in §8.2 to show that r ≥ m/2.
Fix r, and suppose that in a failure-free execution both processes send
messages in all rounds (we can easily modify an algorithm that does not
have this property to have it, without increasing r). We will start with a
sequence of executions with pA = pB = 0. Let X0 be the execution in which
no messages are lost, X1 the execution in which A’s last message is lost,
X2 the execution in which both A and B’s last messages are lost, and so
on, with Xk for 0 ≤ k ≤ 2r losing k messages split evenly between the two
processes, breaking ties in favor of losing messages from A.
When i is even, Xi is indistinguishable from Xi+1 by A; it follows that
p0A is the same in both executions. Because we no longer have agreement,
it may be that p0B (Xi ) and p0B (Xi+1 ) are not the same as p0A in either
execution; but since both are within 1 of p0A , the difference between them is
at most 2. Next, because Xi+1 to Xi+2 are indistinguishable to B, we have
p0B (Xi+1 ) = p0B (Xi+2 ), which we can combine with the previous claim to get
|p0B (Xi ) − p0B (Xi+2 )|. A simple induction then gives p0B (X2r ) ≤ 2r, where
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 483
Suppose that we augment the system so that senders are notified imme-
diately when their messages are delivered. We can model this by making the
delivery of a single message an event that updates the state of both sender
and recipient, both of which may send additional messages in response. Let
us suppose that this includes attempted deliveries to faulty processes, so that
any non-faulty process that sends a message m is eventually notified that m
has been delivered (although it might not have any effect on the recipient if
the recipient has already crashed).
1. Show that this system can solve consensus with one faulty process
when n = 2.
2. Show that this system cannot solve consensus with two faulty processes
when n = 3.
Solution
1. To solve consensus, each process sends its input to the other. Whichever
input is delivered first becomes the output value for both processes.
2. To show impossibility with n = 3 and two faults, run the usual FLP
proof until we get to a configuration C with events e0 and e such that
Ce is 0-valent and Ce0 e is 1-valent (or vice versa). Observe that e
and e0 may involve two processes each (sender and receiver), for up
to four processes total, but only a process that is involved in both e
and e0 can tell which happened first. There can be at most two such
processes. Kill both, and get that Ce0 e is indistinguishable from Cee0
for the remaining process, giving the usual contradiction.
Solution
There is an easy reduction to FLP that shows f ≤ n/2 is necessary (when n
√
is even), and a harder reduction that shows f < 2 n − 1 is necessary. The
easy reduction is based on crashing every other process; now no surviving
process can suspect any other survivor, and we are back in an asynchronous
message-passing system with no failure detector and 1 remaining failure (if
f is at least n/2 + 1).
√
The harder reduction is to crash every ( n)-th process. This partitions
√ √
the ring into n segments of length n − 1 each, where there is no failure
detector in any segment that suspects any process in another segment. If an
algorithm exists that solves consensus in this situation, then it does so even
if (a) all processes in each segment have the same input, (b) if any process in
√
one segment crashes, all n − 1 process in the segment crash, and (c) if any
process in a segment takes a step, all take a step, in some fixed order. Under
this additional conditions, each segment can be simulated by a single process
√
in an asynchronous system with no failure detectors, and the extra n − 1
√
failures in 2 n − 1 correspond to one failure in the simulation. But we can’t
solve consensus in the simulating system (by FLP), so we can’t solve it in
the original system either.
On the other side, let’s first boost completeness of the failure detector,
by having any process that suspects another transmit this submission by
reliable broadcast. So now if any non-faulty process i suspects i + 1, all the
non-faulty processes will suspect i + 1. Now with up to t failures, whenever
I learn that process i is faulty (through a broadcast message passing on the
suspicion of the underlying failure detector, I will suspect processes i + 1
through i + t − f as well, where f is the number of failures I have heard
about directly. I don’t need to suspect process i + t − f + 1 (unless there is
some intermediate process that has also failed), because the only way that
this process will not be suspected eventually is if every process in the range
i to i + t − f is faulty, which can’t happen given the bound t.
Now if t is small enough that I can’t cover the entire ring with these
segments, then there is some non-faulty processes that is far enough away
from the nearest preceding faulty process that it is never suspected: this gives
us an eventually strong failure detector, and we can solve consensus using the
standard Chandra-Toueg ♦S algorithm from §13.4 or [CT96]. The inequality
I am looking for is f (t − f ) < n, where the √left-hand side is maximized by
setting 2
f = t/2, which gives t /4 < n or t < 2n. This leaves a gap of about
√
2 between the upper and lower bounds; I don’t know which one can be
improved.
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 486
√
I am indebted to Hao Pan for suggesting the Θ( n) upper and lower
bounds, which corrected an error in my original draft solution to this problem.
Termination If at some time an odd number of sensors are active, and from
that point on no sensor changes its state, then some process eventually
sets off an alarm.
For what values of n is it possible to construct such a protocol?
Solution
It is feasible to solve the problem for n < 3.
For n = 1, the unique process sets off its alarm as soon as its sensor
becomes active.
For n = 2, have each process send a message to the other containing
its sensor state whenever the sensor state changes. Let s1 and s2 be the
state of the two process’s sensors, with 0 representing inactive and 1 active,
and let pi set off its alarm if it receives a message s such that s ⊕ si = 1.
This satisfies termination, because if we reach a configuration with an odd
number of active sensors, the last sensor to change causes a message to be
sent to the other process that will cause it to set off its alarm. It satisfies
no-false-positives, because if pi sets off its alarm, then s¬i = s because at
most one time unit has elapsed since p¬i sent s; it follows that s¬i ⊕ si = 1
and an odd number of sensors are active.
No such protocol is possible for n ≥ 3. Make p1 ’s sensor active. Run the
protocol until some process pi is about to enter an alarm state (this occurs
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 487
• enq(Q) always pushes the identity of the current process onto the tail
of the queue.
• deq(Q) tests if the queue is nonempty and its head is equal to the
identity of the current process. If so, it pops the head and returns
true. If not, it does nothing and returns false.
The rationale for these restrictions is that this is the minimal version of
a queue needed to implement a starvation-free mutex using Algorithm 18.2.
What is the consensus number of this object?
Solution
The restricted queue has consensus number 1.
Suppose we have 2 processes, and consider all pairs of operations on Q
that might get us out of a bivalent configuration C. Let x be an operation
carried out by p that leads to a b-valent state, and y an operation by q that
leads to a (¬b)-valent state. There are three cases:
• One enq and one deq operation. Suppose x is an enq and y a deq. If
Q is empty or the head is not q, then y is a no-op: p can’t distinguish
Cx from Cyx. If the head is q, then x and y commute. The same holds
in reverse if x is a deq and y an enq.
• Two enq operations. This is a little tricky, because Cxy and Cyx are
different states. However, if Q is nonempty in C, whichever process
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 488
isn’t at the head of Q can’t distinguish them, because any deq operation
returns false and never reaches the newly-enqueued values. This leaves
the case where Q is empty in C. Run p until it is poised to do
x0 = deq(Q) (if this never happens, p can’t distinguish Cxy from Cyx);
then run q until it is poised to do y 0 = deq(Q) as well (same argument
as for p). Now allow both deq operations to proceed in whichever order
causes them both to succeed. Since the processes can’t tell which deq
happened first, they can’t tell which enq happened first either. Slightly
more formally, if we let α be the sequence of operations leading up to
the two deq operations, we’ve just shown Cxyαx0 y 0 is indistinguishable
from Cyxαy 0 x0 to both processes.
In all cases, we find that we can’t escape bivalence. It follows that Q can’t
solve 2-process consensus.
Solution
We’ll use a snapshot object a to control access to an infinite array f
of fetch-and-increments, where each time somebody writes to the imple-
mented object, we switch to a new fetch-and-increment. Each cell in a
holds (timestamp, base), where base is the starting value of the simulated
fetch-and-increment. We’ll also use an extra fetch-and-increment T to hand
out timestamps.
Code is in Algorithm H.1.
Since this is all straight-line code, it’s trivially wait-free.
Proof of linearizability is by grouping all operations by timestamp, us-
ing s[i].timestamp for FetchAndIncrement operations and t for write op-
erations, then putting write before FetchAndIncrement, then ordering
FetchAndIncrement by return value. Each group will consist of a write(v)
for some v followed by zero or more FetchAndIncrement operations, which
will return increasing values starting at v since they are just returning values
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 489
1 procedure FetchAndIncrement()
2 s ← snapshot(a)
3 i ← arg maxi (s[i].timestamp)
4 return f [s[i].timestamp] + s[i].base
5 procedure write(v)
6 t ← FetchAndIncrement(T )
7 a[myId] ← (t, v)
Algorithm H.1: Resettable fetch-and-increment
Solution
Let b be the box object. Represent b by a snapshot object a, where a[i] holds
a pair (∆wi , ∆hi ) representing the number of times process i has executed
IncWidth and IncHeight; these operations simply increment the appropriate
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 490
value and update the snapshot object. Let GetArea take a snapshot and
P P
return ( i ∆wi ) ( i ∆hi ); the cost of the snapshot is O(n).
To see that this is optimal, observe that we can use IncWidth and GetArea
to represent inc and read for a standard counter. The Jayanti-Tan-Toueg
bound applies to counters, giving a worst-case cost of Ω(n) for GetArea.
Solution
The consensus number is ∞; a single lockable register solves consensus for
any number of processes. Code is in Algorithm H.2.
1 write(r, input)
2 lock(r)
3 return read(r)
Algorithm H.2: Consensus using a lockable register
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 491
Termination and validity are trivial. Agreement follows from the fact
that whatever value is in r when lock(r) is first called will never change,
and thus will be read and returned by all processes.
Solution
It is possible to solve the problem for all n except n = 3. For n = 1, there are
no non-faulty processes, so the specification is satisfied trivially. For n = 2,
there is only one non-faulty process: it can just keep its own counter and
return an increasing sequence of timestamps without talking to the other
process at all.
For n = 3, it is not possible. Consider an execution in which messages
between non-faulty processes p and q are delayed indefinitely. If the Byzantine
process r acts to each of p and q as it would if the other had crashed, this
execution is indistinguishable to p and q from an execution in which r is
correct and the other is faulty. Since there is no communication between
p and q, it is easy to construct and execution in which the specification is
violated.
For n ≥ 4, the protocol given in Algorithm H.3 works.
The idea is similar to the Attiya, Bar-Noy, Dolev distributed shared
memory algorithm [ABND95]. A process that needs a timestamp polls n − 1
other processes for the maximum values they’ve seen and adds 1 to it; before
returning, it sends the new timestamp to all other processes and waits to
receive n − 1 acknowledgments. The Byzantine process may choose not to
answer, but this is not enough to block completion of the protocol.
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 492
1 procedure getTimestamp()
2 ci ← ci + 1
3 send probe(ci ) to all processes
4 wait to receive response(ci , vj ) from n − 1 processes
5 vi ← (maxj vj ) + 1
6 send newTimestamp(ci , vi ) to all processes
7 wait to receive ack(ci ) from n − 1 processes
8 return vi
To show the timestamps are increasing, observe that after the completion
of any call by i to getTimestamp, at least n − 2 non-faulty processes j have
a value vj ≥ vi . Any call to getTimestamp that starts later sees at least
n − 3 > 0 of these values, and so computes a max that is at least as big as
vi and then adds 1 to it, giving a larger value.
Solution
Yes. With f < n/2 and ♦S, we can solve consensus using Chandra-
Toueg [CT96]. Since this gives a unique decision value, it solves k-set
APPENDIX H. SAMPLE ASSIGNMENTS FROM FALL 2011 493
Solution
Algorithm H.4 implements a counter from a set object, where the counter
read consists of a single call to size(S). The idea is that each increment is
implemented by inserting a new element into S, so |S| is always equal to the
number of increments.
1 procedure inc(S)
2 nonce ← nonce + 1
3 add(S, hmyId, noncei).
4 procedure read(S)
5 return size(S)
Algorithm H.4: Counter from set object
4
Clarification added during exam.
Appendix I
This appendix contains final exams from previous times the course was
offered, and is intended to give a rough guide to the typical format and
content of a final exam. Note that the topics covered in past years were not
necessarily the same as those covered this year.
494
APPENDIX I. ADDITIONAL SAMPLE FINAL EXAMS 495
your choosing, and that the design of the consensus protocol can depend on
the number of processes N .
Solution
The consensus number is 2.
To implement 2-process wait-free consensus, use a single fetch-and-
subtract register initialized to 1 plus two auxiliary read/write registers
to hold the input values of the processes. Each process writes its input to its
own register, then performs a fetch-and-subtract(1) on the fetch-and-subtract
register. Whichever process gets 1 from the fetch-and-subtract returns its
own input; the other process (which gets 0) returns the winning process’s
input (which it can read from the winning process’s read/write register.)
To show that the consensus number is at most 2, observe that any two
fetch-and-subtract operations commute: starting from state x, after fetch-and-
subtract(k1 ) and fetch-and-subtract(k2 ) the value in the fetch-and-subtract
register is max(0, x − k1 − k2 ) regardless of the order of the operations.
Solution
Upper bound
Because there are no failures, we can appoint a leader and have it decide.
The natural choice is some process near the middle, say pb(N +1)/2c . Upon
receiving an input, either directly through an input event or indirectly from
another process, the process sends the input value along the line toward the
leader. The leader takes the first input it receives and broadcasts it back out
in both directions as the decision value. The worst case is when the protocol
is initiated at pN ; then we pay 2(N − b(N + 1)/2c) time to send all messages
out and back, which is N time units when N is even and N − 1 time units
when N is odd.
Lower bound
Proving an almost-matching lower bound of N − 1 time units is trivial: if
p1 is the only initiator and it starts at time t0 , then by an easy induction
argument,in the worst case pi doesn’t learn of any input until time t0 + (i − 1),
and in particular pN doesn’t find out until after N − 1 time units. If pN
nonetheless decides early, its decision value will violate validity in some
executions.
But we can actually prove something stronger than this: that N time
units are indeed required when N is odd. Consider two slow executions Ξ0
and Ξ1 , where (a) all messages are delivered after exactly one time unit in
each execution; (b) in Ξ0 only p1 receives an input and the input is 0; and (c)
in Ξ1 only pN receives an input and the input is 1. For each of the executions,
construct a causal ordering on events in the usual fashion: a send is ordered
before a receive, two events of the same process are ordered by time, and
other events are partially ordered by the transitive closure of this relation.
Now consider for Ξ0 the set of all events that precede the decide(0) event
of p1 and for Ξ1 the set of all events that precede the decide(1) event of
pN . Consider further the sets of processes S0 and S1 at which these events
occur; if these two sets of processes do not overlap, then we can construct
an execution in which both sets of events occur, violating Agreement.
Because S0 and S1 overlap, we must have |S0 | + |S1 | ≥ N + 1, and so
at least one of the two sets has size at least d(N + 1)/2e, which is N/2 + 1
when N is even. Suppose that it is S0 . Then in order for any event to occur
at pN/2+1 at all some sequence of messages must travel from the initial input
to p1 to process pN/2+1 (taking N/2 time units), and the causal ordering
implies that an additional sequence of messages travels back from pN/2+1 to
APPENDIX I. ADDITIONAL SAMPLE FINAL EXAMS 497
p1 before p1 decides (taking and additional N/2 time units). The total time
is thus N .
In either case, the solution should work for arbitrarily many processes—solving
mutual exclusion when N = 1 is not interesting. You are also not required
in either case to guarantee lockout-freedom.
Solution
1. Disproof: With append registers only, it is not possible to solve mutual
exclusion. To prove this, construct a failure-free execution in which
the processes never break symmetry. In the initial configuration, all
processes have the same state and thus execute either the same read
operation or the same append operation; in either case we let all N
operations occur in some arbitrary order. If the operations are all
reads, all processes read the same value and move to the same new
state. If the operations are all appends, then no values are returned and
again all processes enter the same new state. (It’s also the case that
the processes can’t tell from the register’s state which of the identical
append operations went first, but we don’t actually need to use this
fact.)
APPENDIX I. ADDITIONAL SAMPLE FINAL EXAMS 498
2. Since the processes are anonymous, any solution that depends on them
having identifiers isn’t going to work. But there is a simple solution
that requires only appending single bits to the register.
Each process trying to enter a critical section repeatedly executes an
append-and-fetch operation with argument 0; if the append-and-fetch
operation returns either a list consisting only of a single 0 or a list
whose second-to-last element is 1, the process enters its critical section.
To leave the critical section, the process does append-and-fetch(1).
Solution
Pick some leader node to implement the object. To execute an operation,
send the operation to the leader node, then have the leader carry out the
operation (sequentially) on its copy of the object and send the results back.
each i less than k − 1 and a[k − 1] ← v; and (b) returns a snapshot of the
new contents of the array (after the shift).
What is the consensus number of this object as a function of k?
Solution
We can clearly solve consensus for at least k processes: each process calls
shift-and-fetch on its input, and returns the first non-null value in the buffer.
So now we want to show that we can’t solve consensus for k + 1 processes.
Apply the usual FLP-style argument to get to a bivalent configuration C
where each of the k + 1 processes has a pending operation that leads to a
univalent configuration. Let e0 and e1 be particular operations leading to
0-valent and 1-valent configurations, respectively, and let e2 . . . ek be the
remaining k − 1 pending operations.
We need to argue first that no two distinct operations ei and ej are
operations of different objects. Suppose that Cei is 0-valent and Cej is
1-valent; then if ei and ej are on different objects, Cei ej (still 0-valent) is
indistinguishable by all processes from Cej ei (still 1-valent), a contradiction.
Alternatively, if ei and ej are both b-valent, there exists some (1−b)-valent ek
such that ei and ej both operate on the same object as ek , by the preceding
argument. So all of e0 . . . ek are operations on the same object.
By the usual argument we know that this object can’t be a register. Let’s
show it can’t be a ring buffer either. Consider the configurations Ce0 e1 . . . ek
and Ce1 . . . ek . These are indistinguishable to the process carrying out ek
(because its sees only the inputs to e1 through ek in its snapshot). So they
must have the same valence, a contradiction.
It follows that the consensus number of a k-element ring buffer is exactly
k.
Solution
First observe that each row and column of the torus is a bidirectional ring, so
we can run e.g. Hirschbirg and Sinclair’s O(n log n)-message protocol within
each of these rings to find the smallest identifier in the ring. We’ll use this
to construct the following algorithm:
1. Run Hirschbirg-Sinclair in each row to get a local leader for each row;
this takes n × O(n log n) = O(n2 log n) messages. Use an additional n
messages per row to distribute the identifier for the row leader to all
nodes and initiate the next stage of the protocol.
2. Run Hirschbirg-Sinclair in each column with each node adopting the row
leader identifier as its own. This costs another O(n2 log n) messages;
at the end, every node knows the minimum identifier of all nodes in
the torus.
The total message complexity is O(n2 log n). (I suspect this is optimal,
but I don’t have a proof.)
3. Give the best lower bound you can on the total message complexity of
the pre-processing and search algorithms in the case above.
Solution
1. Run depth-first search to find the matching key and return the corre-
sponding value back up the tree. Message complexity is O(|E|) = O(n)
(since each node has only O(1) links).
2. Basic idea: give each node a copy of all key-value pairs, then searches
take zero messages. To give each node a copy of all key-value pairs we
could do convergecast followed by broadcast (O(n) message complexity)
or just flood each pair O(n2 ). Either is fine since we don’t care about
the message complexity of the pre-processing stage.
Solution
No protocol for two: turn an anti-consensus protocol with outputs in {0, 1}
into a consensus protocol by having one of the processes always negate its
output.
A protocol for three: Use a splitter.
Solution
Here is an impossibility proof. Suppose there is such an algorithm, and let
it correctly decide “odd” on a ring of size 2k + 1 for some k and some set
of leader inputs. Now construct a ring of size 4k + 2 by pasting two such
rings together (assigning the same values to the leader bits in each copy)
and run the algorithm on this ring. By the usual symmetry argument, every
corresponding process sends the same messages and makes the same decisions
in both rings, implying that the processes incorrectly decide the ring of size
4k + 2 is odd.
Solution
Disproof: Let s1 and s2 be processes carrying out snapshots and let w1 and
w2 be processes carrying out writes. Suppose that each wi initiates a write
of 1 to a[wi ], but all of its messages to other processes are delayed after it
updates its own copy awi [wi ]. Now let each si receive responses from 3n/4 − 1
processes not otherwise mentioned plus wi . Then s1 will return a vector
with a[w1 ] = 1 and a[w2 ] = 0 while s2 will return a vector with a[w1 ] = 0
and a[w2 ] = 1, which is inconsistent. The fact that these vectors are also
disseminated throughout at least 3n/4 other processes is a red herring.
the queue, and deq() removes and returns the smallest value in the queue,
or returns null if the queue is empty. (If there is more than one copy of the
smallest value, only one copy is removed.)
What is the consensus number of this object?
Solution
The consensus number is 2. The proof is similar to that for a queue.
To show we can do consensus for n = 2, start with a priority queue with
a single value in it, and have each process attempt to dequeue this value. If
a process gets the value, it decides on its own input; if it gets null, it decides
on the other process’s input.
To show we can’t do consensus for n = 3, observe first that starting from
any states C of the queue, given any two operations x and y that are both
enqueues or both dequeues, the states Cxy and Cyx are identical. This
means that a third process can’t tell which operation went first, meaning
that a pair of enqueues or a pair of dequeues can’t get us out of a bivalent
configuration in the FLP argument. We can also exclude any split involving
two operations on different queues (or other objects) But we still need to
consider the case of a dequeue operation d and an enqueue operation e on
the same queue Q. This splits into several subcases, depending on the state
C of the queue in some bivalent configuration:
1. C = {}. Then Ced = Cd = {}, and a third process can’t tell which of
d or e went first.
2. C is nonempty and e = enq(v), where v is greater than or equal to the
smallest value in C. Then Cde and Ced are identical, and no third
process can tell which of d or e went first.
3. C is nonempty and e = enq(v), where v is less than any value in C.
Consider the configurations Ced and Cde. Here the process pd that
performs d can tell which operation went first, because it either obtains
v or some other value v 0 6= v. Kill this process. No other process in Ced
or Cde can distinguish the two states without dequeuing whichever of
v or v 0 was not dequeued by pd . So consider two parallel executions
Cedσ and Cdeσ where σ consists of an arbitrary sequence of operations
ending with a deq on Q by some process p (if no process ever attempts
to dequeue from Q, then we have already won, since the survivors can’t
distinguish Ced from Cde). Now the state of all objects is the same
after Cedσ and Cdeσ, and only pd and p have different states in these
two configurations. So any third process is out of luck.
Appendix J
I/O automata
505
APPENDIX J. I/O AUTOMATA 506
All output actions of the components are also output actions of the
composition. An input action of a component is an input of the composition
only if some other component doesn’t supply it as an output; in this case
1
Note that infinite (but countable) compositions are permitted.
APPENDIX J. I/O AUTOMATA 507
J.1.5 Fairness
I/O automata come with a built-in definition of fair executions, where an
execution of A is fair if, for each equivalence class C of actions in task(A),
3. the execution is infinite and there are infinitely many states in which
no action in C is enabled.
J.2.1 Example
A property we might demand of the spambot above (or some other abstraction
of a message channel) is that it only delivers messages that have previously
been given to it. As a trace property this says that in any trace t, if
tk = spam(m), then tj = setMessage(m) for some j < k. (As a set, this
is just the set of all sequences of external spambot-actions that have this
property.) Call this property P .
To prove that the spambot automaton given above satisfies P , we might
argue that for any execution s0 a0 s1 a1 . . . , that si = m in the last setMessage
action preceding si , or ⊥ if there is no such action. This is easily proved
by induction on i. It then follows that since spam(m) can only transmit the
current state, that if spam(m) follows si = m that it follows some earlier
setMessage(m) as claimed.
However, there are traces that satisfy P that don’t correspond to execu-
tions of the spambot; for example, consider the trace setMessage(0)setMessage(1)spam(0).
This satisfies P (0 was previously given to the automaton spam(0)), but
the automaton won’t generate it because the 0 was overwritten by the later
setMessage(1) action. Whether this is indicates a problem with our automa-
ton not being nondeterministic enough or our trace property being too weak
is a question about what we really want the automaton to do.
1. P is nonempty.
APPENDIX J. I/O AUTOMATA 510
Because of the last restrictions, it’s enough to prove that P holds for all
finite traces of A to show that it holds for all traces (and thus for all fair
traces), since any trace is a limit of finite traces. Conversely, if there is some
trace or fair trace for which P fails, the second restriction says that P fails
on any finite prefix of P , so again looking at only finite prefixes is enough.
The spambot property mentioned above is a safety property.
Safety properties are typically proved using invariants, properties that
are shown by induction to hold in all reachable states.
J.2.3.1 Example
Consider two spambots A1 and A2 where we identify the spam(m) operation
of A1 with the setMessage(m) operation of A2 ; we’ll call this combined
action spam1 (m) to distinguish it from the output actions of A2 . We’d like
to argue that the composite automaton A1 + A2 satisfies the safety property
(call it Pm ) that any occurrence of spam(m) is preceded by an occurrence
of setMessage(m), where the signature of Pm includes setMessage(m) and
spam(m) for some specific m but no other operations. (This is an example
of where trace property signatures can be useful without being limited to
actions of any specific component automaton.)
To do so, we’ll prove a stronger property Pm 0 , which is P
m modified
to include the spam1 (m) action in its signature. Observe that Pm 0 is the
the later says that any trace that includes spam(m) has a previous spam1 (m)
and the former says that any trace that includes spam1 (m) has a previous
setMessage(m). Since these properties hold for the individual A1 and A2 ,
their product, and thus the restriction Pm0 , holds for A + A , and so P (as
1 2 m
a further restriction) holds for A1 + A2 as well.
Now let’s prove the liveness property for A1 + A2 , that at least one
occurrence of setMessage yields infinitely many spam actions. Here we
let L1 = {at least one setMessage action ⇒ infinitely many spam1 actions}
and L2 = {at least one spam1 action ⇒ infinitely many spam actions}. The
product of these properties is all sequences with (a) no setMessage actions or
(b) infinitely many spam actions, which is what we want. This product holds
if the individual properties L1 and L2 hold for A1 + A2 , which will be the
case if we set task(A1 ) and task(A2 ) correctly.
J.2.4.1 Example
A single spambot A can simulate the conjoined spambots A1 +A2 . Proof: Let
f (s) = (s, s). Then f (⊥) = (⊥, ⊥) is a start state of A1 + A2 . Now consider
a transition (s, a, s0 ) of A; the action a is either (a) setMessage(m), giving
s0 = m; here we let x = setMessage(m)spam1 (m) with trace(x) = trace(a)
since spam1 (m) is internal and f (s0 ) = (m, m) the result of applying x; or (b)
a = spam(m), which does not change s or f (s); the matching x is spam(m),
which also does not change f (s) and has the same trace.
A different proof could take advantage of f being a relation by defining
f (s) = {(s, s0 )|s0 ∈ states(A2 )}. Now we don’t care about the state of
A2 , and treat a setMessage(m) action of A as the sequence setMessage(m)
in A1 + A2 (which updates the first component of the state correctly) and
treat a spam(m) action as spam1 (m)spam(m) (which updates the second
component—which we don’t care about—and has the correct trace.) In some
cases an approach of this sort is necessary because we don’t know which
simulated state we are heading for until we get an action from A.
Note that the converse doesn’t work: A1 +A2 don’t simulate A, since there
are traces of A1 +A2 (e.g. setMessage(0)spam1 (0)setMessage(1)spam(0)) that
don’t restrict to traces of A. See [Lyn96, §8.5.5] for a more complicated
example of how one FIFO queue can simulate two FIFO queues and vice
versa (a situation called bisimulation).
Since we are looking at traces rather than fair traces, this kind of simula-
tion doesn’t help much with liveness properties, but sometimes the connection
between states plus a liveness proof for B can be used to get a liveness proof
for A (essentially we have to argue that A can’t do infinitely many action
without triggering a B-action in an appropriate task class). Again see [Lyn96,
§8.5.5].
Bibliography
[AAB+ 11] Yehuda Afek, Noga Alon, Omer Barad, Eran Hornstein, Naama
Barkai, and Ziv Bar-Joseph. A biological solution to a funda-
mental distributed computing problem. science, 331(6014):183–
185, 2011.
[AABJ+ 11] Yehuda Afek, Noga Alon, Ziv Bar-Joseph, Alejandro Cornejo,
Bernhard Haeupler, and Fabian Kuhn. Beeping a maximal in-
dependent set. In Proceedings of the 25th International Confer-
ence on Distributed Computing, DISC’11, pages 32–50, Berlin,
Heidelberg, 2011. Springer-Verlag.
[AACH+ 11] Dan Alistarh, James Aspnes, Keren Censor-Hillel, Seth Gilbert,
and Morteza Zadimoghaddam. Optimal-time adaptive tight
renaming, with applications to counting. In Proceedings of
the Thirtieth Annual ACM SIGACT-SIGOPS Symposium on
Principles of Distributed Computing, pages 239–248, June 2011.
514
BIBLIOGRAPHY 515
[AACV17] Yehuda Afek, James Aspnes, Edo Cohen, and Danny Vain-
stein. Brief announcement: Object oriented consensus. In
Elad Michael Schiller and Alexander A. Schwarzmann, editors,
Proceedings of the ACM Symposium on Principles of Distributed
Computing, PODC 2017, Washington, DC, USA, July 25-27,
2017, pages 367–369. ACM, 2017.
[AAD+ 93] Yehuda Afek, Hagit Attiya, Danny Dolev, Eli Gafni, Michael
Merritt, and Nir Shavit. Atomic snapshots of shared memory.
J. ACM, 40(4):873–890, 1993.
[AAD+ 06] Dana Angluin, James Aspnes, Zoë Diamadi, Michael J. Fischer,
and René Peralta. Computation in networks of passively mobile
finite-state sensors. Distributed Computing, pages 235–253,
March 2006.
[AAE08a] Dana Angluin, James Aspnes, and David Eisenstat. Fast com-
putation by population protocols with a leader. Distributed
Computing, 21(3):183–199, September 2008.
[AAE+ 23] Dan Alistarh, James Aspnes, Faith Ellen, Rati Gelashvili, and
Leqi Zhu. Why extension-based proofs fail. SIAM Journal on
Computing, 52(4):913–944, 2023.
[AAG+ 10] Dan Alistarh, Hagit Attiya, Seth Gilbert, Andrei Giurgiu, and
Rachid Guerraoui. Fast randomized test-and-set and renaming.
In Nancy A. Lynch and Alexander A. Shvartsman, editors,
Distributed Computing, 24th International Symposium, DISC
BIBLIOGRAPHY 516
[AAGG11] Dan Alistarh, James Aspnes, Seth Gilbert, and Rachid Guer-
raoui. The complexity of renaming. In Fifty-Second Annual
IEEE Symposium on Foundations of Computer Science, pages
718–727, October 2011.
[ABHMT20] Mirza Ahad Baig, Danny Hendler, Alessia Milani, and Corentin
Travers. Long-lived snapshots with polylogarithmic amortized
step complexity. In Proceedings of the 39th Symposium on Prin-
ciples of Distributed Computing, PODC ’20, page 3140, New
York, NY, USA, 2020. Association for Computing Machinery.
[ABND+ 90] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, David Peleg, and
Rüdiger Reischuk. Renaming in an asynchronous environment.
J. ACM, 37(3):524–548, 1990.
[AC08] Hagit Attiya and Keren Censor. Tight bounds for asynchronous
randomized consensus. Journal of the ACM, 55(5):20, October
2008.
[AHW08] Hagit Attiya, Danny Hendler, and Philipp Woelfel. Tight RMR
lower bounds for mutual exclusion and other problems. In
Proceedings of the 40th annual ACM symposium on Theory of
computing, STOC ’08, pages 217–226, New York, NY, USA,
2008. ACM.
[AKM+ 93] Baruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-
Shamir, and George Varghese. Time optimal self-stabilizing
synchronization. In Proceedings of the twenty-fifth annual ACM
symposium on Theory of computing, pages 652–661. ACM, 1993.
[AKM+ 07] Baruch Awerbuch, Shay Kutten, Yishay Mansour, Boaz Patt-
Shamir, and George Varghese. A time-optional self-stabilizing
synchronizer using a phase clock. IEEE Transactions on De-
pendable and Secure Computing, 4(3):180–190, July–September
2007.
[AKP+ 06] Hagit Attiya, Fabian Kuhn, C. Greg Plaxton, Mirjam Watten-
hofer, and Roger Wattenhofer. Efficient adaptive collect using
randomization. Distributed Computing, 18(3):179–188, 2006.
BIBLIOGRAPHY 520
[AM99] Yehuda Afek and Michael Merritt. Fast, wait-free (2k − 1)-
renaming. In PODC, pages 105–112, 1999.
[CCN12] Luca Cardelli and Attila Csikász-Nagy. The cell cycle switch
computes approximate majority. Scientific Reports, 2, 2012.
[Cha93] Soma Chaudhuri. More choices allow more faults: Set consen-
sus problems in totally asynchronous systems. Inf. Comput.,
105(1):132–158, 1993.
BIBLIOGRAPHY 525
[CIL94] Benny Chor, Amos Israeli, and Ming Li. Wait-free consensus
using asynchronous hardware. SIAM J. Comput., 23(4):701–
712, 1994.
[CLM+ 16] Michael B. Cohen, Yin Tat Lee, Gary L. Miller, Jakub Pachocki,
and Aaron Sidford. Geometric median in nearly linear time.
In Daniel Wichs and Yishay Mansour, editors, Proceedings
of the 48th Annual ACM SIGACT Symposium on Theory of
Computing, STOC 2016, Cambridge, MA, USA, June 18-21,
2016, pages 9–21. ACM, 2016.
[CR93] Ran Canetti and Tal Rabin. Fast asynchronous byzantine agree-
ment with optimal resilience. In S. Rao Kosaraju, David S. John-
son, and Alok Aggarwal, editors, Proceedings of the Twenty-
Fifth Annual ACM Symposium on Theory of Computing, May
16-18, 1993, San Diego, CA, USA, pages 42–51. ACM, 1993.
BIBLIOGRAPHY 526
[CV86] Richard Cole and Uzi Vishkin. Deterministic coin tossing with
applications to optimal parallel list ranking. Information and
Control, 70(1):32–53, 1986.
[DN93] Cynthia Dwork and Moni Naor. Pricing via processing or com-
batting junk mail. In Ernest F. Brickell, editor, Advances in
Cryptology - CRYPTO ’92, 12th Annual International Cryptol-
ogy Conference, Santa Barbara, California, USA, August 16-20,
1992, Proceedings, volume 740 of Lecture Notes in Computer
Science, pages 139–147. Springer, 1993.
[EGSZ20] Faith Ellen, Rati Gelashvili, Nir Shavit, and Leqi Zhu. A
complexity-based classification for multiprocessor synchroniza-
tion. Distributed Computing, 33(2):125–144, Apr 2020.
[EHS12] Faith Ellen, Danny Hendler, and Nir Shavit. On the inherent se-
quentiality of concurrent objects. SIAM Journal on Computing,
41(3):519–536, 2012.
[ER+ 18] Robert Elsässer, Tomasz Radzik, et al. Recent results in popu-
lation protocols for exact majority and leader election. Bulletin
of EATCS, 3(126), 2018.
[FHS98] Faith Ellen Fich, Maurice Herlihy, and Nir Shavit. On the
space complexity of randomized synchronization. J. ACM,
45(5):843–862, 1998.
[FHS05] Faith Ellen Fich, Danny Hendler, and Nir Shavit. Linear lower
bounds on real-world implementations of concurrent objects. In
Foundations of Computer Science, Annual IEEE Symposium on,
pages 165–173, Los Alamitos, CA, USA, 2005. IEEE Computer
Society.
[FL06] Rui Fan and Nancy A. Lynch. An ω(n log n) lower bound on
the cost of mutual exclusion. In Eric Ruppert and Dahlia
Malkhi, editors, Proceedings of the Twenty-Fifth Annual ACM
Symposium on Principles of Distributed Computing, PODC
2006, Denver, CO, USA, July 23-26, 2006, pages 275–284.
ACM, 2006.
[FLMS05] Faith Ellen Fich, Victor Luchangco, Mark Moir, and Nir Shavit.
Obstruction-free algorithms can be practically wait-free. In
Pierre Fraigniaud, editor, Distributed Computing, 19th Inter-
national Conference, DISC 2005, Cracow, Poland, September
26-29, 2005, Proceedings, volume 3724 of Lecture Notes in
Computer Science, pages 78–92. Springer, 2005.
[GKL15] Juan A. Garay, Aggelos Kiayias, and Nikos Leonardos. The bit-
coin backbone protocol: Analysis and applications. In Elisabeth
Oswald and Marc Fischlin, editors, Advances in Cryptology -
EUROCRYPT 2015 - 34th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Sofia,
Bulgaria, April 26-30, 2015, Proceedings, Part II, volume 9057
of Lecture Notes in Computer Science, pages 281–310. Springer,
2015.
[GW12a] George Giakkoupis and Philipp Woelfel. On the time and space
complexity of randomized test-and-set. In Darek Kowalski and
BIBLIOGRAPHY 532
[LPS23] Jacob Leshno, Rafael Pass, and Elaine Shi. Can open decen-
tralized ledgers be economically secure? Cryptology ePrint
Archive, Paper 2023/1516, 2023. https://eprint.iacr.org/
2023/1516.
[NT87] Gil Neiger and Sam Toueg. Substituting for real time and
common knowledge in asynchronous distributed systems. In
Proceedings of the sixth annual ACM Symposium on Principles
of distributed computing, PODC ’87, pages 281–293, New York,
NY, USA, 1987. ACM.
[NW98] Moni Naor and Avishai Wool. The load, capacity, and avail-
ability of quorum systems. SIAM J. Comput., 27(2):423–447,
1998.
[RST01] Yaron Riany, Nir Shavit, and Dan Touitou. Towards a practical
snapshot algorithm. Theor. Comput. Sci., 269(1-2):163–201,
2001.
[vABHL03] Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John
Langford. Captcha: Using hard ai problems for security. In
Eli Biham, editor, Advances in Cryptology — EUROCRYPT
2003, pages 294–311, Berlin, Heidelberg, 2003. Springer Berlin
Heidelberg.
[Zhu16] Leqi Zhu. A tight space bound for consensus. In Daniel Wichs
and Yishay Mansour, editors, Proceedings of the 48th Annual
ACM SIGACT Symposium on Theory of Computing, STOC
2016, Cambridge, MA, USA, June 18-21, 2016, pages 345–350.
ACM, 2016.
Index
542
INDEX 543