LTCI, Télécom Paris, Institut Polytechnique de Paris, Francejoaopba01@gmail.com Nomadic Labs, Paris, Francelucianofdz@gmail.com LTCI, Télécom Paris, Institut Polytechnique de Paris, Francepetr.kuznetsov@telecom-paris.fr LTCI, Télécom Paris, Institut Polytechnique de Paris, Francematthieu.rambaud@telecom-paris.fr \CopyrightJ. P. Bezerra, L. Freitas, P. Kuznetsov, and M. Rambaud\ccsdesc[500]Theory of computation Design and analysis of algorithms Distributed algorithms

Asynchronous Latency and Fast Atomic Snapshot

João Paulo Bezerra    Luciano Freitas    Petr Kuznetsov    Matthieu Rambaud
Abstract

This paper introduces a novel, fast atomic-snapshot protocol for asynchronous message-passing systems. In the process of defining what “fast” means exactly, we spot a few interesting issues that arise when conventional time metrics are applied to long-lived asynchronous algorithms. We reveal some gaps in latency claims made in earlier work on snapshot algorithms, which hamper their comparative time-complexity analysis. We then come up with a new unifying time-complexity metric that captures the latency of an operation in an asynchronous, long-lived implementation. This allows us to formally grasp latency improvements of our atomic-snapshot algorithm with respect to the state-of-the-art protocols: optimal latency in fault-free runs without contention, short constant latency in fault-free runs with contention, the worst-case latency proportional to the number of active concurrent failures, and constant, amortized latency.

keywords:
Asynchronous systems, time complexity, atomic snapshot, crash faults

1 Introduction

The distributed snapshot abstraction [13, 25] allows us to determine a consistent view of the global system state. Originally proposed in the asynchronous fault-free message-passing context, it was later cast to shared-memory models [3] as a vector of shared variables, exporting an update operation that writes to one of them and a snapshot operation that returns the current vector state. Atomic snapshot can be implemented from conventional read-write registers in a wait-free manner, i.e., tolerating unpredictable delays or failures of any number of processes. By applying the reduction from shared memory to message-passing [6], one can get an asynchronous distributed atomic-snapshot implementation that tolerates up to a minority of faulty processes. The atomic-snapshot object (ASO) is, in a strong sense, equivalent to lattice agreement (LA) [8, 16]111Lattice agreement can be seen as a weak version of consensus, where decided values form totally ordered joins of proposed values in a join semi-lattice.: one can implement the other with no time overhead. A long line of results improve time and space complexities of ASO and LA algorithms in shared-memory [5, 4, 7] and message-passing [16, 20, 18, 15, 17] models.

In this paper, we focus on the latency of operations in message-passing ASO implementations. We propose an LA (and, thus, ASO) algorithm that is faster than (or matches) state-of-the-art solutions in all execution scenarios: with or without failures and with or without contention. The comparative analysis of our algorithm with respect to the existing work appeared to be challenging: as we show below, earlier work considered diverging metrics and execution scenarios, and sometimes used over-simplified reasoning. We observed that conventional metrics [12, 6, 2] are not always suitable for long-lived asynchronous algorithms. Besides, prior latency analyses of ASO and LA algorithms [16, 20, 18, 15, 17] used different ways to measure time, which complicated the comparison. We therefore propose a unifying time-complexity analysis of prior asynchronous ASO and LA algorithms with respect to a new metric, which we take as a contribution on its own.

Fault-free
w/o contention
Fault-free
w/ contention
Worst-case
Amortized
constant
Faleiro et al. [16] 22 1616 O(k)O(k) yes
Imbs et al. [20] 22 O(n)O(n) O(n)O(n) no
Garg et al. [17] 6\geq 6 8\geq 8 O(k)O(k) yes
Garg et al. [17] + Zheng et al. [26] O(logn)O(\log n) O(logn)O(\log n) O(logn)O(\log n) no
Delporte et al. [15] 22 O(n)O(n) O(n)O(n) no
This paper 22 88 O(k)O(k) yes
Table 1: Comparative time complexity of atomic-snapshot algorithms in asynchronous message-passing models. The table shows results for Single-Writer Multi-Reader (SWMR) implementations.

Lamport [24] proposed to measure time in asynchronous systems as the length of the longest chain of causally related messages, the metric used to to determine the best-case latency of consensus [24] and Crusader Agreement [1]. However, as we show in this paper, the metric may produce counter-intuitive results for protocols involving all-to-all communication. For instance, in the failure-free case, the nn-process reliable-broadcast [11] exhibits a causal chain of nn hops, even though, intuitively one expects it to terminate in one.

Building upon the classical approach by Canetti and Rabin [12], Abraham et al. [2] recently proposed an elegant metric to grasp the good-case latency of broadcast protocols. We observe, however, that the metric does not really apply to executions of long-lived abstractions, which may contain holes – periods of inactivity when no protocol messages are in transit. Moreover, we get diverging results when applying [2] and [12] to operation latency, i.e., the time between invocation and response events of a given operation.

We therefore extend the round-based approach to long-lived abstractions (such as ASO and LA) and establish a framework to measure the time between arbitrary events, subsequently showing that the results align with those from earlier classical metrics [6, 12].

To summarize, our main contribution is a novel LA (and, thus, ASO) protocol that is generally faster than prior solutions, i.e., it exhibits shorter latency of its operations in various scenarios. In our complexity analysis, we compared our protocol to the original long-lived LA algorithm by Faleiro et al. [16]222We consider the ASO protocol built atop the lattice agreement protocol proposed in [16]., the first direct message-passing ASO implementation by Delporte et al. [15], the ASO algorithm based on the set-constraint broadcast by Imbs et al. [20], and the ASO algorithms by Garg et al. based on generic construction of ASO from one-shot LA with constant latency in fault-free runs [17] or logn\log{n} worst-case latency by Zheng et al. [26] (where nn is the number of processes).

As shown in Table 1, in a fault-free run, the latency of an operation of our protocol is the optimal two rounds if there is no contention and eight rounds in the presence of contention (four rounds if we ignore the “buffering” period when a value is submitted but not yet proposed), regardless of the number of contending operations. Moreover, the worst-case latency of our algorithm is proportional to the number of active failures kk, i.e., the number of faulty processes whose messages are received within the operation’s interval, therefore the amortized latency (averaged over a large number of operations in a long-lived execution) converges to the fault-free constant.

Our protocol can be seen as a novel combination of techniques employed separately in prior work. These include the use of generalized (long-lived) lattice agreement as a basis for ASO [22], the helping mechanism where all the learned lattice values are shared [22], relaying of messages to all replicas instead of quorum-based rounds [20, 17, 21, 14], and buffering proposed values until previous proposals get committed [16]. Similar to earlier proposals [16], our algorithm involves O(n2)O(n^{2}) (all-to-all) communication, which is compensated by its constant (amortized) latency. An interesting open question is whether one can reduce the communication cost in good runs, while maintaining constant amortized latency.

The paper is organized as follows. In Section 2, we present our model assumptions, and in Section 3, we state the problem of atomic snapshot and relate it to generalized lattice agreement. In Section 4, we present our protocol and analyze its correctness. In Section 5, we discuss several gaps in the complexity analyses of earlier work. In Section 6, we present a comparative analysis of time metrics. Certain proofs and a detailed discussion of time complexity of earlier protocols are delegated to the appendix.

2 System Model

Processes and Channels. We consider a system of nn processes (or nodes). Processes communicate by exchanging messages m=(s,r,data)m=(s,r,\textit{data}) with a sender ss, a receiver rr, and a message content data.

A process is an automaton modeled as a tuple (,𝒪,𝒬,q0,π)(\mathcal{I},\mathcal{O},\mathcal{Q},q_{0},\pi), where \mathcal{I} is a set of inputs (messages and application calls) it can receive, 𝒪\mathcal{O} is a set of outputs (messages and application responses), 𝒬\mathcal{Q} is a (potentially infinite) set of possible internal states, q0𝒬q_{0}\in\mathcal{Q} is an initial state and π:2×𝒬2𝒪×𝒬\pi:2^{\mathcal{I}}\times\mathcal{Q}\rightarrow 2^{\mathcal{O}}\times\mathcal{Q} is a transition function mapping a set of inputs and a state to a set of outputs and a new state. Each process ii is assigned an algorithm AiA_{i} which defines (,𝒪,𝒬,q0,π)(\mathcal{I},\mathcal{O},\mathcal{Q},q_{0},\pi), a distributed algorithm is an array [A1,,An][A_{1},...,A_{n}].

Events and Configurations. Application calls and responses are tuples (i,aReq)(i,\textit{aReq}) and (i,aRep)(i,\textit{aRep}) with a process identifier, a request, and a reply respectively.

An event ee is a tuple (R,P,S)(R,P,S) where RR is a set of received messages and/or application calls, PP is the set of nodes producing the event and SS is a set of messages sent and/or application responses. We denote receive(e)\textsf{receive}(e) as the set of messages received in the event, conversely, send(e)\textsf{send}(e) is the set of messages sent. A message hop is a pair (e,e)(e,e^{\prime}) in which ee^{\prime} receives at least one message that was sent in ee.

Messages in transit are stored in the message buffer.333We assume that every message in the message buffer is unique. A configuration CC is an (n+1)(n+1)-array [M,s1,,sn][M,s_{1},...,s_{n}] with the buffer’s state M=C[0]M=C[0] and the local state si=C[i]s_{i}=C[i] of each node ii (i=1,,ni=1,\ldots,n). Let C0C_{0} denote the initial configuration in which every sis_{i} is an initial state and the buffer MM is empty.

Executions. An execution (or run) is an alternating sequence C0e1C1e2C_{0}e_{1}C_{1}e_{2}... of configurations and events, where for each j>0j>0 and i=1,,ni=1,\ldots,n:

  1. 1.

    receive(ej)Cj1[0]\textsf{receive}(e_{j})\subseteq C_{j-1}[0];

  2. 2.

    ej.Se_{j}.S consists of messages and application outputs that the nodes in ej.Pe_{j}.P produce, given their algorithms, their states in Cj1C_{j-1} and their inputs in ej.Re_{j}.R; the nodes in ej.Pe_{j}.P carry their states from Cj1C_{j-1} to CjC_{j}, accordingly;

  3. 3.

    for the nodes iej.Pi\notin e_{j}.P, Cj1[i]=Cj[i]C_{j-1}[i]=C_{j}[i].

Each triple Cj1ejCjC_{j-1}e_{j}C_{j} is called a step. In this paper, we consider algorithms defined by deterministic automata, and we assume a default initial state. Thus, we sometimes skip configurations and simply write e1e2e_{1}e_{2}\ldots.

In an infinite execution, a process is correct if it takes part in infinitely many steps, and faulty otherwise. We only consider infinite executions in which f<n/2f<n/2, where ff is the number of faulty processes and nn is the total number of processes. Moreover, in an infinite execution, messages exchanged among correct processes are eventually received, i.e., if there is an event ee from a correct process sending a message mm to another correct process, then there is ee^{\prime} succeeding ee such that mreceive(e)m\in\textsf{receive}(e^{\prime}).

We also assume that the communication channels neither alter nor create messages. Finally, we assume that the channels are FIFO: messages from a given source to a given destination arrive in the order they were sent. A FIFO channel can be implemented by attaching sequence numbers to messages, without extra communication or time overhead.

3 Lattice Agreement and Atomic Snapshot

3.1 Lattice Agreement

A join semi-lattice is defined as a tuple (,)(\mathcal{L},\sqsubseteq), where \sqsubseteq is a partial order on a set \mathcal{L}, such that for any pair of values uu and vv in \mathcal{L}, there exists a unique least upper bound uvu\sqcup v\in\mathcal{L} (\sqcup is called the join operator). Also, uu and vv in \mathcal{L} are said to be comparable if uvvuu\sqsubseteq v\vee v\sqsubseteq u.

The (generalized) Lattice Agreement abstraction LA [16] defined over (,)(\mathcal{L},\sqsubseteq) can be accessed by every node with operation Propose(v)\textsf{Propose}(v), vv\in\mathcal{L} (we say that the node proposes vv) which triggers the reply event Learn(w)\textsf{Learn}(w) (we say that the node learns ww). Each node may invoke Propose any number of times but does so sequentially, that is, it initiates a new operation only after the previous one has returned.444Following [22], without loss of generality, we slightly modified the conventional LA interface [8, 16] by introducing the explicit Propose operation that combine proposing and learning the values, the properties of the abstraction are adjusted accordingly. The abstraction must satisfy:

Definition 3.1 (Lattice Agreement (LA)).
  • Validity. Any value learned by a node is the join of some set of proposed values that includes its last proposal.

  • Stability. The values learned by any node increase monotonically, with respect to \sqsubseteq.

  • Consistency. All values learned are comparable, with respect to \sqsubseteq.

  • Liveness. If a correct node proposes vv, it eventually learns a value ww.

3.2 Atomic Snapshot Object (ASO)

An atomic snapshot object (ASO) stores a vector of values R=[r1,,rm]R=[r_{1},...,r_{m}] and exports two operations: update(i,v)\textsf{update}(i,v) and snapshot()\textsf{snapshot}(). The update(i,v)\textsf{update}(i,v) operation writes the value vv in R[i]R[i] and returns OK, and snapshot()\textsf{snapshot}() returns the entire vector RR. An ASO implementation guarantees that every operation invoked by a correct process eventually completes. It also ensures that each of its operations appears to take effect in a single instance of time within its interval, i.e., it is linearizable [19].

Linearizable executions. The history of an execution EE is the subsequence of EE consisting of invocations and responses of ASO operations (update and snapshot). A history is sequential if each of its invocations is followed by a matching response. An execution is linearizable if, to each of its operation (update or snapshot, except, possibly, for incomplete ones), we can assign an indivisible point within its interval (called a linearization point), so that the operations put in the order of its linearizaton points constitute a legal sequential history of ASO (called a linearization), i.e., every snapshot operation returns a vector where every position contains the last value written to it (using an update operation), or the initial value if there are no such prior updates. Equivalently, a linearizable execution EE with history HH should have a linearization SS, a legal sequential history that (1) no node can locally distinguish a completion of HH and SS and (2) SS respect the real-time order of HH, i.e., if operation opop completes before operation opop^{\prime} in HH, then opop^{\prime} cannot precede opop in SS.

We say that an ASO is single writer SW (resp. multi writer MW) if for each of its registers R[i]R[i], only a single process can call update(i,v)\textsf{update}(i,v) (resp. every process can call update(i,v)\textsf{update}(i,v)). In this paper, we focus mostly on SWMR atomic snapshot objects. In Table 1 we give results only for SWMR. A MWMR ASO can be devised from SWMR by adding an additional “read” phase when updating values (see Section 3.3 for more details).

Next, we show that ASO can be implemented on top of LA with no additional overhead.

3.3 From LA to ASO

To implement a SWMR ASO on top of LA, we consider a partially ordered set \mathcal{L}^{*} of (m+n)(m+n)-vectors (recall that mm is the size of the ASO vector and nn is the number of nodes), defined as follows.

A vector position 1,,m\ell\in 1,\ldots,m is defined as a tuple (w,v)R(w,v)\in R_{\ell}, where vv is an element of a value set VV equipped with a total order V\leq^{V}, and ww\in\mathbb{N} is the number of write operations on position \ell. A total order on RR_{\ell} is defined in the natural way: for any two tuples (w1,v1)R(w2,v2)(w1<w2)(w1=w2v1Vv2)(w_{1},v_{1})\leq^{R_{\ell}}(w_{2},v_{2})\equiv(w_{1}<w_{2})\vee(w_{1}=w_{2}\wedge v_{1}\leq^{V}v_{2}). For each process i=1,,ni=1,\ldots,n, the vector position m+im+i stores the number of snapshot operations executed by ii.

The lattice \mathcal{L}^{*} of (m+n)(m+n)-position vectors is then the composition R1××Rm×nR_{1}\times\ldots\times R_{m}\times\mathbb{N}^{n}. The partial order \sqsubseteq^{*} on \mathcal{L}^{*} is then naturally defined as the compositions of <R1××<Rm×n<^{R_{1}}\times\ldots\times<^{R_{m}}\times\leq^{n}. The composed join operator \sqcup^{*} is the composition of max\max operators, one for each position in the (m+n)(m+n)-position vectors. The construction implies a join semi-lattice [22].

In Algorithm 1, we show how to implement an SWMR atomic snapshot on top of LA defined over the semi-lattice (,,)(\mathcal{L}^{*},\sqsubseteq^{*},\sqcup^{*}). For simplicity, we assume that m=nm=n, i.e., the size of the array is the total number of nodes, and that each node ii has a dedicated register ii where it can write. Elements of \mathcal{L}^{*} are then 2n2n-vectors.

When a node ii calls update(\textsf{update}(i,v)), it increments its local writing sequence number ww and proposes a 2n2n-vector with (w,v)(w,v) in position ii and initial values in all other positions to the LA object. The vector learned from this proposal is ignored. When the node ii calls snapshot()\textsf{snapshot}(), it increments its local reading sequence number rr proposes a 2n2n-vector with rr in position n+in+i and initial values in all other positions to the LA object. The values in the first nn positions of the returned vector is then returned as the snapshot outcome.

1:Distributed objects:
2:  LA instance on (,,)(\mathcal{L}^{*},\sqsubseteq^{*},\sqcup^{*})
3:upon startup
4:  w0w\leftarrow 0
5:  r0r\leftarrow 0
6:operation update(i,vi,v)
7:  ww+1w\leftarrow w+1
8:  VV\leftarrow 2n2n-vector with (w,v)(w,v) in position ii and initial values in all other positions
9:  LA.Propose(V)(V)
10:operation snapshot()
11:  rr+1r\leftarrow r+1
12:  VV\leftarrow 2n2n-vector with rr in position n+in+i and initial values in all other positions
13:  return (LA.Propose(V))[1..n](\textsc{LA}.\textsf{Propose}(V))[1..n]
Algorithm 1 LASWMR\textsc{LA}\to\textbf{SW}\textbf{MR} ASO conversion.

Algorithm 1 can be extended to implement a MWMR ASO: to update a position jj in the array, a node first takes a snapshot to get the current state, gets up-to-date sequence number in position jj and proposes its value with a higher sequence number. With this modification, the update operation takes two LA operations instead of one. We refer the reader to [22] for further details.

Theorem 3.2.

Algorithm 1 implements ASO.

Proof 3.3.

We show that every execution of Algorithm 1 is linearizable.

Consider an execution of Algorithm 1, let HH be its history. Every operation (snapshot or update) is associated with a unique sequence number and performs a Propose operation on the LA object. If there is an LA.Propose\textsc{LA}.\textsf{Propose} operation that returns (w,v)(w,v) in position ii, by Validity of LA, there is an operation update(i,v)\textsf{update}(i,v) executed by node ii with sequence number ww that started before the LA.Propose\textsc{LA}.\textsf{Propose} completed and invoked a LA.\textsc{LA}.. In this case, we say that the update operation is successful. Notice that by Validity of LA, the update must have invoked LA.Propose\textsc{LA}.\textsf{Propose} with a vector containing (w,v)(w,v) in position ii.

Now we order complete snapshot operations and complete successful update operations in the order of the values returned by their LA.Propose\textsc{LA}.\textsf{Propose} operations (by Consistency of LA, these values are totally ordered. As each of these LA.Propose\textsc{LA}.\textsf{Propose} returns a value containing its unique sequence number (Stability of LA) , this order respects the real-time order of HH. A successful update operation performed by node ii with (w,v)(w,v) in position ii that has no complete LA.Propose\textsc{LA}.\textsf{Propose} is placed right before the first snapshot whose LA.Propose\textsc{LA}.\textsf{Propose} returns this value. By construction, the resulting sequential history is legal and locally indistinguishable from a completion of HH.

Finally, Liveness implies that every operation invoked by a correct process eventually completes.

4 LA Protocol

In Algorithm 2, we describe our protocol for solving LA. To guarantee amortized constant complexity, the protocol relies on two basic mechanisms, employed separately in earlier work [16, 22]. First, when a node receives a request (e.g., a value from the application), it first adds the request to a buffer (𝑀𝑃𝑜𝑜𝑙\mathit{MPool}) and then relays it before starting a proposal. This ensures that “idle” nodes also help in committing the request. Second, the node relays every learned value so that nodes that are “stuck” can adopt values from other nodes.

4.1 Overview

The protocol is based on helping: every node tries to commit every proposed value it is aware of. As long as the node has active proposals that are not yet committed, it buffers newly arriving proposals in the local variable 𝑀𝑃𝑜𝑜𝑙\mathit{MPool}. Intuitively, in the worst case, an LA.Propose\textsc{LA}.\textsf{Propose} operation has to wait until one of the concurrently invoked LA.Propose\textsc{LA}.\textsf{Propose} operations complete. Once this happens, the currently buffered value is put in the local dictionary 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} and shared with the other nodes (lines 31 and 32) via a PROPOSE message. In turn, the other nodes relay the message to each other (line 38). The dictionary maps a value to the number of times it is "supported" by the nodes (using PROPOSE messages). Once a value vv in the dictionary assembles a quorum of nfn-f of PROPOSE,v\langle\textbf{PROPOSE},v\rangle messages, i.e., 𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]nf\mathit{Pending}[v]\geq n-f (line 39), the value is added to the 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Validated} variable. Once every value currently stored in 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} is in 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Validated} (line 41), the operation completes with 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Validated} as the learned value. As the final element of the helping mechanism, each process broadcasts every value it learns (lines 45 and 51), ensuring that processes that might otherwise remain “stuck” can complete their current proposal.

In summary, the algorithm relies on four main ideas: 1) buffering incoming requests when already proposing, 2) sharing every received proposal so all processes are quickly aware of active ones, 3) initiating a new proposal only after all currently seen proposals have been validated, and 4) broadcasting learned values to help other processes make progress.

Message complexity. The protocol is comprised of three all-to-all communication phases: processes send and relay requests at lines 23 and 27, proposals at lines 32 and 38, and accepted values at lines 45 and 51. The total number of messages is therefore O(n2)O(n^{2}). However, a value in a PROPOSE message can include up to nn distinct requests, and a value in a ACCEPT message may have arbitrary size. Therefore, in Appendix A, we present a refined protocol description in which processes exchange O(n2)O(n^{2}) messages per individual request. This efficiency is achieved by relaying only the differences between current and previously received proposals and the learned values in phases 22 and 33, thus eliminating redundant messages with the same requests.

14:upon Startup
15:  𝑀𝑃𝑜𝑜𝑙,𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔,𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑,𝐿𝑒𝑎𝑟𝑛𝑒𝑑\mathit{MPool},\mathit{Proposing},\mathit{Validated},\mathit{Learned}\leftarrow\perp
16:  𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending}\leftarrow\emptyset
17:operation Propose(vv)
18:  SendRequest(vv)
19:  wait until v𝐿𝑒𝑎𝑟𝑛𝑒𝑑v\sqsubseteq\mathit{Learned}
20:  return 𝐿𝑒𝑎𝑟𝑛𝑒𝑑\mathit{Learned}
21:operation SendRequest(vv)
22:  𝑀𝑃𝑜𝑜𝑙𝑀𝑃𝑜𝑜𝑙v\mathit{MPool}\leftarrow\mathit{MPool}\sqcup v
23:  send \langleREQUEST,v\textbf{REQUEST},v\rangle to every other node
24:upon Receive \langleREQUEST,v\textbf{REQUEST},v\rangle from a node
25:  if v𝑀𝑃𝑜𝑜𝑙𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝐿𝑒𝑎𝑟𝑛𝑒𝑑v\not\sqsubseteq\mathit{MPool}\sqcup\mathit{Proposing}\sqcup\mathit{Learned} then
26:   𝑀𝑃𝑜𝑜𝑙𝑀𝑃𝑜𝑜𝑙v\mathit{MPool}\leftarrow\mathit{MPool}\sqcup v
27:   send \langleREQUEST,v\textbf{REQUEST},v\rangle to every other node   
28:upon event (𝑀𝑃𝑜𝑜𝑙)(𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=)(\mathit{MPool}\neq\perp)\wedge(\mathit{Proposing}=\perp)
29:  𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝑀𝑃𝑜𝑜𝑙\mathit{Proposing}\leftarrow\mathit{MPool}
30:  𝑀𝑃𝑜𝑜𝑙\mathit{MPool}\leftarrow\perp
31:  𝑃𝑒𝑛𝑑𝑖𝑛𝑔[𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔]1\mathit{Pending}[\mathit{Proposing}]\leftarrow 1
32:  send \langlePROPOSE,𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\textbf{PROPOSE},\mathit{Proposing}\rangle to every other node
33:upon Receive \langlePROPOSE,v\textbf{PROPOSE},v\rangle from a node
34:  if v𝑃𝑒𝑛𝑑𝑖𝑛𝑔.keys()v\in\mathit{Pending}.\textsf{keys}() then
35:   𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]++\mathit{Pending}[v]++
36:  else
37:   𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]1\mathit{Pending}[v]\leftarrow 1
38:   send \langlePROPOSE,v\textbf{PROPOSE},v\rangle to every node   
39:upon exists vv s.t. 𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]=nf\mathit{Pending}[v]=n-f
40:  𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑v\mathit{Validated}\leftarrow\mathit{Validated}\sqcup v
41:upon event 𝑃𝑒𝑛𝑑𝑖𝑛𝑔.keys()𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\bigsqcup\mathit{Pending}.\textsf{keys}()\sqsubseteq\mathit{Validated}
42:  if 𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Learned}\sqsubset\mathit{Validated} then
43:   𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Learned}\leftarrow\mathit{Validated}
44:   𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing}\leftarrow\perp
45:   send \langleACCEPT,𝐿𝑒𝑎𝑟𝑛𝑒𝑑\textbf{ACCEPT},\mathit{Learned}\rangle to every node   
46:upon Receive \langleACCEPT,w\textbf{ACCEPT},w\rangle from a node
47:  if (𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝐿𝑒𝑎𝑟𝑛𝑒𝑑w)(\mathit{Proposing}\sqcup\mathit{Learned}\sqsubseteq w) then
48:   𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑w\mathit{Validated}\leftarrow\mathit{Validated}\sqcup w
49:   𝐿𝑒𝑎𝑟𝑛𝑒𝑑w\mathit{Learned}\leftarrow w
50:   𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing}\leftarrow\perp
51:   send \langleACCEPT,𝐿𝑒𝑎𝑟𝑛𝑒𝑑\textbf{ACCEPT},\mathit{Learned}\rangle to every node   

Algorithm 2 Long-Lived LA: code for node xx.

4.2 Correctness

Validity and Stability are immediate. We now proceed with Consistency and Liveness.

Lemma 4.1.

If nodes ii and jj learn, resp., values wiw_{i} and wjw_{j}, then wiw_{i} and wjw_{j} are comparable.

Proof 4.2.

Suppose that (wiwj)(wjwi)(w_{i}\not\sqsubseteq w_{j})\wedge(w_{j}\not\sqsubseteq w_{i}). Then there must exist viwiv_{i}\sqsubseteq w_{i} and vjwjv_{j}\sqsubseteq w_{j} such that viwjv_{i}\not\sqsubseteq w_{j} and vjwiv_{j}\not\sqsubseteq w_{i}.

Let QiQ_{i} (resp. QjQ_{j}) be the quorum ii used to include viv_{i} 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Validated} at line 39. Since QiQjQ_{i}\cap Q_{j}\neq\emptyset, there is a common node xx that sent PROPOSE,vi\langle\textbf{PROPOSE},v_{i}\rangle to ii and PROPOSE,vj\langle\textbf{PROPOSE},v_{j}\rangle to jj, but since channels are FIFO, either ii received vjv_{j} or jj received viv_{i} from xx before learning a value, therefore adding the value to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending}. Suppose it was ii that received vjv_{j} before viv_{i}, from the condition of line 41, ii could not have learned wiw_{i} if vj𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑v_{j}\not\sqsubseteq\mathit{Validated}.

Lemma 4.3.

If a correct node xx sets 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=v\mathit{Proposing}=v, xx eventually learns a value with vv.

Proof 4.4.

A node xx sends a PROPOSE message to every other node whenever it adds a new value to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} (line 38). If xx is correct, it will receive at least nfn-f PROPOSE messages for every value in 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending}, adding the value to 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Validated}. Therefore, the condition in line 41 is never satisfied from some point on only if xx keeps adding a new value to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} before all the current ones are validated.

Since each node proposes only one value at a time (until it learns a value, lines 28, 44, 50), for xx to indefinitely add new values to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending}, there must be at least one other node that keeps learning values and proposing new ones. Without loss of generality, let yy be one such node. Since faulty nodes eventually crash and stop taking steps, yy must be correct. Every time yy learns a new value ww it sends ACCEPT,w\langle\textbf{ACCEPT},w\rangle to xx, and because channels are FIFO, xx receives the ACCEPT message before the new value proposed by yy. Eventually (because xx sent its proposal to yy), one of the received values ww contains xsx^{\prime}s 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing} and the condition on line 46 is satisfied, xx then learns ww.

Lemma 4.5.

If a correct node calls Propose(v)\textsf{Propose}(v), it eventually sets 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=v\mathit{Proposing}=v^{\prime}, vvv\sqsubseteq v^{\prime}.

Proof 4.6.

Let a correct node xx call Propose(v)\textsf{Propose}(v), xx then includes vv in 𝑀𝑃𝑜𝑜𝑙\mathit{MPool} (line 22). If xx is not currently proposing, that is, the current value of 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing} is \perp, then it meets the condition in line 28 and immediately sets 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=𝑀𝑃𝑜𝑜𝑙\mathit{Proposing}=\mathit{MPool}. Otherwise, by Lemma 4.3, it eventually learns a value and sets 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=\mathit{Proposing}=\perp in lines 44 and 50, thus meeting the condition in line 28 and setting 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=𝑀𝑃𝑜𝑜𝑙\mathit{Proposing}=\mathit{MPool}.

Lemmas 4.1, 4.3 and 4.5 imply:

Theorem 4.7.

Algorithm 2 implements Generalized Lattice Agreement.

Corollary 4.8.

Algorithms 1 and 2 implement Atomic Snapshot.

4.3 Time metric

We now define the latency metric we are going to use in evaluating time complexity. Our metric is inspired by the metric proposed by Abraham et al. [2] (which in turn rephrases the original metric by Canetti and Rabin [12]). The distinguishing feature of our approach is that it also applies to long-lived executions and executions with holes (illustrated in Figure 1).555In Section 6, we show that the three metrics are equivalent in ”hole-free” executions.

Algorithm 3 describes the iterative method that assigns rounds to events in an execution. We give an informal description of the metric below.

Definition 4.9 (Iterative Round Assignment - Informal).

Algorithm 3 assigns round 0 to the initial event, and defines the end of round ii as the last event that receives a message sent in round i1i-1. In addition, if there are no more messages to be received (or in transit), the event inherits the round number of its immediate predecessor.

52:e0:=e0e_{0}^{*}:=e_{0}
53:e0e_{0} is assigned round 0
54:r:=0r:=0
55:for i=1… do
56:  if eie_{i} does not receive a message then
57:   eie_{i} is assigned round rr
58:  else
59:   Let eje_{j} be the oldest event from which eie_{i} receives a message
60:   Let rr^{\prime} be the round assigned to eje_{j} (rr)(r^{\prime}\leq r)
61:   Let ee^{\prime} be the most recent event among ere_{r^{\prime}}^{*} and eje_{j}
62:   All events after ee^{\prime} and up to eie_{i} receive round r+1r^{\prime}+1
63:   er+1:=eie_{r^{\prime}+1}^{*}:=e_{i}
64:   r=r+1r=r^{\prime}+1   
Algorithm 3 Iterative Round Assignment (IRA)
Refer to caption
Figure 1: Example of round assignment using IRA. Arrows represent message transmissions and the number below an event corresponds to its round. A “hole” in communication appears betwen events e3e_{3} and e5e_{5}.
Definition 4.10 (IRA - Arbitrary Events).

To measure the latency between two events eie_{i} and eje_{j}, we assign rounds according to Algorithm 3, starting from eie_{i}, with all events up to and including eie_{i} receiving round 0. The latency between eie_{i} and eje_{j} is then given by the round assigned to eje_{j}.

We say that an application request (or simply request, when there is no ambiguity) completes once the receiving node learns a value which includes the request. For a specific node ii, we are interested in measuring the latency between the event eCe_{C} in which ii receives a value vv from the application software, and an event eRe_{R}, in which ii learns a value ww with vv.

4.4 Time complexity of Algorithm 2

We define latency as the number of rounds spanning between the moment a correct process receives an application call and the moment it returns from the operation. In evaluating the latency of our protocol, we consider two types of executions: (1) the fault-free case, when all processes are correct, and (2) the worst-case, when only a majority of processes are correct.

A snapshot operation op precedes another operation op\textit{op}^{\prime} if the response event of op happens before the call event for op\textit{op}^{\prime}. Two operations are said to be concurrent if none precedes the other. For ASO protocols, we analyze latency in fault-free runs of an operation op in two distinct scenarios: (a) without contention, i.e., when no other operation overlaps in time with op, and (b) with contention, i.e., when there might be an arbitrary number of concurrent operations.

Garg et al. [18] use the notion of amortized time complexity, i.e., the average operation latency taken over a large number of operations in an execution. In some protocols, including ours, the latency of an operation is only affected by the number of faulty processes whose messages are received during the operation’s interval (we call these processes active-faulty). Intuitively, faulty processes take a finite amount of steps, so in these protocols a failure can only affect a finite number of operations. In this paper, we also distinguish ASO protocols with constant time complexity.

Next, we establish the optimality of our protocol under no-contention. A protocol implementing LA tolerates kk faults if it satisfies all the properties of Definition 3.1 in every execution with at most kk faulty processes.

Theorem 4.11.

Let 𝒫\mathcal{P} be a distributed protocol that implements LA and tolerates at least one faulty process. Then, there exists a fault-free run of 𝒫\mathcal{P} in which an LA operation requires at least two rounds of communication to complete without contention.

Proof 4.12.

Consider an operation op initiated by node xx, with call event eCe_{C} and response event eRe_{R}. Suppose op completes in at most one round in fault-free, contention-free executions.

We first show that there exists an execution E=e1,,eC,,eRE=e_{1},\ldots,e_{C},\ldots,e_{R} such that:

  • xx is the only process to take a step in eRe_{R},

  • no message sent by xx in eC,,eRe_{C},\ldots,e_{R} is received by any other process before eRe_{R}.

If multiple processes perform steps in the same event ee, we can conceptually "split" ee into a sequence of events e1,e2,e^{1},e^{2},\ldots, where each process takes the step in its own dedicated event. Since their steps are independent, these split events are indistinguishable from the original ee from each process’s perspective. This reasoning also applies to eRe_{R}.

Now, assume for the sake of contradiction that in every fault-free, contention-free execution containing both eCe_{C} and eRe_{R}, there exists some process yxy\neq x that receives a message mm–sent by xx in the interval eC,,eRe_{C},\ldots,e_{R}–before eRe_{R} occurs.

Let eMe_{M} denote the event where yy receives mm. We define rounds from eCe_{C}’s perspective:

  • all events up to eCe_{C} are in round 0,

  • round 11 ends at the last event eLe_{L} that receives a message originating in round 0.

If mm is sent after eCe_{C}, then we can construct EE so that all messages from round 0 are received before mm. This ensures that eMe_{M} occurs after eLe_{L}, meaning eMe_{M} is in round 2. Since eRe_{R} occurs after eMe_{M}, it too is assigned round 2–contradicting our assumption that op completes in one round.

If instead mm is sent in eCe_{C}, we can again construct the execution so that all round 0 messages are received before or at the same time as mm, making eM=eLe_{M}=e_{L}. Since eRe_{R} occurs after eMe_{M}, it is again assigned to round 2–a contradiction.

These contradictions hold regardless of whether op is concurrent with any other operation. Hence, such an execution EE must exist. Now consider an extension EE^{\prime} of EE where all messages sent by xx after (and including) eCe_{C} are indefinitely delayed, while messages from other nodes are not.

Suppose a node zz invokes a new operation op\textit{op}^{\prime} after eRe_{R}, making op\textit{op}^{\prime} non-concurrent with op. Since protocol 𝒫\mathcal{P} tolerates at least one faulty process, and xx appears to have crashed in EE^{\prime}, node zz must eventually complete op\textit{op}^{\prime} without any process receiving any messages from xx.

Let vv and ww be the value proposed and the value learned by xx in op, and let vv^{\prime} and ww^{\prime} be the corresponding values for zz in op\textit{op}^{\prime}. By Validity, we know vwv\sqsubseteq w, and by Consistency, we know www\sqsubseteq w^{\prime}, hence vwv\sqsubseteq w^{\prime}.

However, since no process receives a message from xx since eCe_{C}, no one could have known about vv, contradicting the requirement that ww^{\prime} must contain vv.

Finally, after op\textit{op}^{\prime} completes, we can allow all delayed messages from xx to be received, making all processes correct in the final execution EE^{\prime}. This completes the proof.

Theorem 4.13.

In a fault-free run without contention, a request takes at most 22 rounds to complete.

Proof 4.14.

Consider a contention-free request with call event eCe_{C} and return event eRe_{R} invoked by a node ii. There are no call events for other nodes between eCe_{C} and eRe_{R}, but some messages from previous proposals may still be in transit.

Suppose vv is the value to be proposed for the application call. If ii is not proposing (has 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=\mathit{Proposing}=\perp) when it receives vv, then it directly sends PROPOSE,v\langle\textbf{PROPOSE},v\rangle to everyone. Let ePe_{P} be the last event in which a process receives PROPOSE,v\langle\textbf{PROPOSE},v\rangle from ii, then every process also sends PROPOSE,v\langle\textbf{PROPOSE},v\rangle by at most ePe_{P}. Now take eFe_{F} as the final event in which a process receives PROPOSE,v\langle\textbf{PROPOSE},v\rangle in the execution, and eSe_{S} as the corresponding sending event. It must be that eSe_{S} happens between eCe_{C} and (potentially including) ePe_{P}. Also, because the channels are FIFO, every previous proposal must have been validated before eFe_{F}, and ii will learn a value containing vv by at most eFe_{F}. Let eCe_{C} be assigned round 0, then ePe_{P} happens at most in round 11. As a consequence, eSe_{S} is assigned either 0 or 11, thus eFe_{F} can be assigned at most round 22. Then, by the end of round 22, ii already has vv validated.

Now suppose that ii is proposing when it receives vv, so it still has a value vv^{\prime} in 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} that is not validated, w.l.o.g. assume that vv^{\prime} is the only one. This value must be from a call that already finished, and the corresponding node sent ACCEPT,w\langle\textbf{ACCEPT},w\rangle containing vv^{\prime} before eCe_{C}. Consider two pairs of events: (eA,eA)(e_{A},e_{A}^{\prime}) and (eC,eC)(e_{C},e_{C}^{\prime}). In the first pair, eAe_{A} is the event where ACCEPT,w\langle\textbf{ACCEPT},w\rangle was first sent, and eAe_{A}^{\prime} is the last event in which ACCEPT,w\langle\textbf{ACCEPT},w\rangle is received from eAe_{A}. In the second, eCe_{C} is the usual application call event and eCe_{C}^{\prime} is the last event in which REQUEST,v\langle\textbf{REQUEST},v\rangle is received from ii. There are two cases to consider: 1) eAe_{A}^{\prime} happens before eCe_{C}^{\prime} and 2) eCe_{C}^{\prime} happens before eAe_{A}^{\prime}.

If it is the first case, then at the moment eCe_{C}^{\prime} happens, every node was already able to propose vv (since there was no other value to be learned). Take the last event eLe_{L} in which a PROPOSE,v\langle\textbf{PROPOSE},v\rangle (or a value containing vv) is received, and eSe_{S} as the corresponding sending event, it follows that ii validates vv by at most eLe_{L} and can learn a value containing it. Let eCe_{C} be assigned round 0, eCe_{C}^{\prime} and eSe_{S} can be assigned at most round 11, and since eLe_{L} receives a message from eSe_{S}, it can be assigned at most round 22. If it is the second case, then all nodes received REQUEST,v\langle\textbf{REQUEST},v\rangle and put vv in 𝑀𝑃𝑜𝑜𝑙\mathit{MPool} before eAe_{A}^{\prime}. Every node proposes vv by at most eAe_{A}^{\prime} (since they can adopt ww and stop any current proposal). Let eLe_{L} be the last event in which a process receives a proposal for vv and eSe_{S} it’s corresponding sending event, similarly to the above cases, eSe_{S} happens between eCe_{C} and eAe_{A}^{\prime}. Now, let eAe_{A} and eCe_{C} be assigned round 0. eSe_{S} can be assigned at most round 11 (eSe_{S} happens before or at eAe_{A}^{\prime}) and eLe_{L} at most 22, which concludes the proof.

Consider an execution of our algorithm, and let FF (|F|f|F|\leq f) be its set of faulty processes.

Lemma 4.15.

Consider an event in which a correct node sends PROPOSE,v\langle\textbf{PROPOSE},v\rangle and the first event in which a correct node learns a value including vv. If no correct node receives a message from a faulty one between these two events, then there are at most 33 rounds between them.

Proof 4.16.

A message sent by a correct node is received by every correct node in the execution, and since correct nodes do not receive messages from faulty ones in the interval we are analyzing, we can consider only events originated from correct nodes. Therefore, we only refer to correct nodes in the following.

Let xx be the node sending PROPOSE,v\langle\textbf{PROPOSE},v\rangle, ePe_{P} be the corresponding event and ePe_{P}^{\prime} the last event a node receives PROPOSE,v\langle\textbf{PROPOSE},v\rangle from xx. Because xx also sends REQUEST,v\langle\textbf{REQUEST},v\rangle, by ePe_{P}^{\prime} every node received the request and must be proposing. Any value learned after ePe_{P}^{\prime} contains vv since all nodes have vv in 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending}.

Now, at the configuration just after applying ePe_{P}^{\prime}, let VV be the set in which wVw\in V satisfies: there exists a (correct) node where ww is in 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} but is not yet validated. Consider a value wVw\in V that is the last whose PROPOSE,w\langle\textbf{PROPOSE},w\rangle is received by any node, where eLe_{L}^{\prime} is the event in which PROPOSE,w\langle\textbf{PROPOSE},w\rangle is last received and eLe_{L} the corresponding sending event. It follows that some node learns a value containing vv by at most eLe_{L}^{\prime}.

Next, take the first event eFe_{F} in which a node sent PROPOSE,w\langle\textbf{PROPOSE},w\rangle, and eFe_{F}^{\prime} the event in which the last PROPOSE,w\langle\textbf{PROPOSE},w\rangle from eFe_{F} is received. Note that eLe_{L} happens at most at eFe_{F}^{\prime} and eFe_{F} at most at ePe_{P}^{\prime}. Let ePe_{P} be assigned round 0, then ePe_{P}^{\prime} (and thus eFe_{F}) can be assigned at most round 11, eFe_{F}^{\prime} (and thus eLe_{L}) at most 22 and lastly, eLe_{L}^{\prime} can be assigned at most round 33. Therefore, there are at most 33 rounds between a propose and the first learn event for vv.

Theorem 4.17.

An operation op takes at most 88 rounds to complete if, during its interval, no correct node receives a message from a faulty one.

Proof 4.18.

Let vv be the value received from the application call for op, ee be the event in which node ii proposes vv (or a value containing vv) and ee^{\prime} the event in which a value including vv is learned for the first time. From Lemma 4.15, there are at most 33 rounds between ee and ee^{\prime}. Since the node that learns vv sends ACCEPT,v\langle\textbf{ACCEPT},v\rangle to everyone, ii receives and adopts it in one extra round. We conclude that in at most 44 rounds every correct node can learn vv.

If ii is already proposing a value when it receives a call for vv, it sends REQUEST,v\langle\textbf{REQUEST},v\rangle to everyone and put it in 𝑀𝑃𝑜𝑜𝑙\mathit{MPool}, so it is proposed next. Let ePe_{P} be the event in which ii initiated its previous proposal to vv, and consider the worst case where the application call eCe_{C} with vv happens just after ePe_{P}. From ePe_{P} to the event in which ii learns its previous proposal ePe_{P}^{\prime} (and thus starts proposing vv), there are at most 44 rounds, and from ePe_{P}^{\prime} to the learning event of vv there are also at most 44 rounds. Therefore, the operation completes in at most 88 rounds.

We say that there are kk active faulty nodes during an operation op if, in between the call and return events for op, a message is received from a total of kk distinct faulty nodes.

Theorem 4.19.

An operation op takes O(k)O(k) rounds to complete, where kk is the number of active faulty nodes during op.

Proof 4.20.

See Appendix B.

Corollary 4.21.

Algorithms 1 and 2 together have an amortized time complexity of 88 rounds.

5 Measuring latency of ASO protocols

We conclude the paper with an overview of time complexity of earlier LA and ASO protocols [16, 15, 20, 17, 18]. We highlight certain gaps in their latency analyses and discuss the ways to fix them. Formalities and proofs are delegated to the appendix.

The first message-passing LA protocol. Faleiro et al. [16] came up with the first LA implementation for asynchronous message-passing systems. They use the metric of [6] to measure latency and conclude that it takes O(n)O(n) rounds to output from a lattice agreement operation in their protocol.

We show in Appendix E the somewhat surprising result that this protocol has constant latency of 1616 rounds in fault-free runs. The upper bound holds as long as no message from faulty processes is received during the interval of the operation, implying that their LA protocol has constant amortized time complexity. We conjecture that the protocol has O(k)O(k) worst-case latency, where kk is the number of actual failures in the execution.

The first direct ASO implementation. Delporte et al. [15] is the first paper to directly implement ASO in message passing systems, instead of using an atomic register implementation [6] and the shared-memory snapshot construction [3].

In fault-free runs without contention, the latency of their protocol is only 22 rounds. In fault-free runs with contention, we support the claim of a bound of O(n)O(n) rounds from [18].

ASO with SCD-Broadcast. Imbs et al. [20] introduce the abstraction of Set Constrained Delivery Broadcast (SCDBroadcast\textsc{SCD}-\textsc{Broadcast}), and show that it allows for implementing LA and ASO with no complexity overhead. In their complexity analysis, they assume bounded message delays and show that the latency of their ASO algorithm in faulty-free and contention-free runs is 22 rounds. In Appendix E, we show that an operation of their resulting ASO algorithm can take Ω(n)\Omega(n) rounds in fault-free runs with contention. We conjecture that this bound is tight, and so the time complexity of their ASO protocol is Θ(n)\Theta(n).

A generic ASO algorithm. Garg et al. [17, 18] give a generic construction for atomic snapshot which uses any one-shot LA protocol (see definition in Appendix D) as a building block (with constant latency overhead). The protocol thus inherits the asymptotic complexity of the underlying LA algorithm. They also provide a protocol for one-shot LA with 22 rounds latency in fault-free runs (using [12]’s metric). Their protocol requires 2 rounds of communication plus two lattice agreement invocations in the good case w/o contention and three lattice invocations with contention, making it at least 6 and 8 message delays, respectively.

For the worst-case latency analysis, they assume an additional requirement over communication channels: if a process executes send(m)\textsf{send}(m), sending mm to a correct process, then mm is eventually received (even if the sender is faulty). Using this assumption, they show a worst-case latency of O(f)O(\sqrt{f}) for their LA protocol.

In this paper, we assume a weaker channel that only guarantees delivery of messages among correct processes. We show that under this model, the LA protocol of [17] has an execution that takes Ω(f)\Omega(f) rounds. We conjecture the upper bound of their protocol to be O(f)O(f), and also that when using the stronger assumption, both our (Section 4) and [20]’s protocol have O(f)O(\sqrt{f}) worst-case latency.

The generic ASO construction may also be combined with the one-shot LA protocol presented in [26], which has worst-case latency of O(logf)O(\log f), providing an object whose update and snapshot operations take O(logf)O(\log f) in both fault-free fault-prone executions. For the sake of completeness, we also provide the time complexity analysis for the one-shot LA protocols from [16] and [20] in Appendix D.

6 Comparative Analysis of Time Measurement Metrics

In this section, we recall metrics used in the literature [6, 12, 2, 23] for measuring time in asynchronous systems. We exhibit executions where the metrics by Attiya et al. [6] and Canetti and Rabin [12] yield arbitrary results due to the presence of holes – “periods of silence” during which no messages are in transit – which are common in long-lived protocols. We show that in a subset of executions without holes, which we refer to as covered executions, these metrics align with the one proposed by Abraham et al. [2]. This is not surprising, as these metrics were designed for distributed tasks, which assume finite hole-free executions. We also recall Lamport’s longest causal chain metric [23] and show that it is not suitable for comparing the ASO protocols we consider here.

Next, we show that the metric from [2] diverges from [6] and [12] when naïvely applied to measure time between arbitrary events. We then show that, after employing our refined method from Section 4.3, they match when measuring rounds between arbitrary events in covered executions.

Finally, we show that both our metric and that of [2] yield equivalent results in cases where [2] is applicable. Altogether, we establish that our metric generalizes [2] and aligns with classical metrics [6, 12] when applied to distributed tasks. A summary of the comparative analysis is presented in Table 2.

Timed
Equivalent to CR
(Covered Executions)
Equivalent to CR
(Arbitrary Events)
Admits
Holes
CR [12] Yes - - No
Round [6, 9] Yes Yes Yes No
NTR [2] No Yes Yes No
LCC [24] No No No Yes
IRA No Yes Yes Yes
Table 2: Comparison between asynchronous time metrics. Metrics that are timed make use of time assignments to determine the number of rounds between events. We compare each metric against CR, evaluating the number of rounds resulting from applying them over entire (covered) executions and between arbitrary events. Blue stands for "good" features and red—for "bad" ones. The equivalence of NTR to CR holds as long as one uses Definition 4.10.

6.1 Definitions

Timed Executions. We assume a global clock, not accessible to the nodes. A timed event e¯\overline{e} is a pair (t,e)(t,e) in which tt is a non-negative real number, we also say that e¯\overline{e} is a time assignment of ee. A timed execution is an alternating sequence C0e¯1C1C_{0}\overline{e}_{1}C_{1}\dots where e¯1=(t1,e1),e¯2=(t2,e2),\overline{e}_{1}=(t_{1},e_{1}),\overline{e}_{2}=(t_{2},e_{2}),\dots, where events e1,e2,e_{1},e_{2},\ldots are equipped with monotonically increasing times t1,t2,t_{1},t_{2},\ldots:

  1. 1.

    tm>tlt_{m}>t_{l} whenever m>lm>l;

  2. 2.

    tlt_{l}\rightarrow\infty as ll\rightarrow\infty.666We require this property to avoid the case where a never-terminating execution has a finite time duration.

A time assignment of EE is a timed execution E¯\overline{E} in which every event eie_{i} in EE is matched with a timed event (ti,ei)(t_{i},e_{i}) in E¯\overline{E} and the sequences of configurations in EE and E¯\overline{E} are the same. Notice that an execution allows for infinitely many time assignments.

Let mm be a message sent in e¯l\overline{e}_{l} and received in e¯m\overline{e}_{m}, the delay of mm is then defined as tmtlt_{m}-t_{l}. For a finite timed execution E¯=C0e¯1e¯lCl\overline{E}=C_{0}\overline{e}_{1}...\overline{e}_{l}C_{l}, we define tstart(E¯)=t1t_{\textit{start}}(\overline{E})=t_{1}, tend(E¯)=tlt_{\textit{end}}(\overline{E})=t_{l} (we use tstartt_{\textit{start}} and tendt_{\textit{end}} when there is no ambiguity) and duration(E¯)=tendtstart\textit{duration}(\overline{E})=t_{\textit{end}}-t_{\textit{start}}.

In the subsequent discussion, given an execution EE, let 𝒯(E)\mathcal{T}(E) denote the set of all timed executions E¯\overline{E} based on EE.

Time Metrics. It is conventional to measure the execution time by the number of communication rounds, typically calculated using the “longest message delay.” These metrics can be applied to both executions and timed executions. The first metric we consider is defined in Definition˜6.1 [6]. When applied to timed executions, this metric assumes a known upper bound on message delays, which can be normalized to one time unit without loss of generality. To apply this metric to an execution, we consider the maximum duration of all possible timed executions that adhere to the upper-bound communication constraint.

Definition 6.1 (Round metric).

Given a timed execution E¯\overline{E}, in which the maximum message delay is bounded by one unit of time, E¯\overline{E} takes duration(E¯)\textit{duration}(\overline{E}) rounds.

By extension, an execution EE takes supE¯𝒯(E)duration(E¯)\sup_{\overline{E}\in\mathcal{T}(E)}{\textit{duration}(\overline{E})} rounds.

In the metric proposed by Attiya and Welch [9, 10], the time assignments are scaled so that the maximum message delay is always 11, thus, the metric produces the same results for executions as Definition 6.1. A more general metric introduced by Canetti and Rabin [12] captures the time complexity of any finite execution. Let E¯\overline{E} be a timed execution, and let δE¯\delta_{\overline{E}} be the maximum message delay in it. Then E¯\overline{E} takes duration(E¯)/δE¯\textit{duration}(\overline{E})/\delta_{\overline{E}} CR rounds.

Definition 6.2 (CR metric).

A finite execution EE takes supE¯𝒯(E)duration(E¯)/δE¯\sup_{\overline{E}\in\mathcal{T}(E)}{\textit{duration}(\overline{E})/\delta_{\overline{E}}} rounds, where δE¯\delta_{\overline{E}} is the maximum message delay of each corresponding timed execution.

Refer to caption
Figure 2: Example of an execution with 2 rounds in the Round, CR and NTR metrics.
Example 6.3.

Figure 2 shows an execution with four events, where we assign a delay of δ\delta to the message exchanges (e1,e3)(e_{1},e_{3}) and (e2,e4)(e_{2},e_{4}), and a delay of δϵ\delta-\epsilon (ϵ>0\epsilon>0) to (e1,e2)(e_{1},e_{2}). By making ϵ\epsilon arbitrarily small, the number of rounds in this execution converges to 22 in the CR metric. The same result is obtained in the Round metric by setting δ=1\delta=1.

Recently, Abraham et al. [2] proposed an elegant approach that can be directly applied to executions without relying on time assignments. We call this metric non-timed rounds (NTR):

Definition 6.4 (NTR metric).

Given an execution EE, each event in EE is assigned a round number as follows:

  • The first event e0e_{0} is assigned round 0. We also write e0=e0e_{0}^{*}=e_{0};

  • For any r1r\geq 1, let ere_{r}^{*} be the last event where a message of round r1r-1 is delivered. All events after er1e_{r-1}^{*} until (and including) event ere_{r}^{*} are in round rr.

The number of rounds in EE is the round assigned to its last event.

Example 6.5.

Coming back to Figure 2, if we assign a round to each event based on Definition 6.4 then e1e_{1} gets round 0, e2e_{2} and e3e_{3} get round 11 and e4e_{4} is assigned round 22. The execution has therefore 22 rounds according to NTR.

Lamport [24] proposed a metric for latency based on the causal chain of messages. The Longest Causal Chain (LCC) was used to show best-case latency of protocols such as consensus [24] and Crusader Agreement [1].

Definition 6.6 (Longest Causal Chain).

Let ee be an event in EE and MM the set of messages received by ee, then ee is assigned round k+1k+1, where kk is the maximum round of an event originating a message in MM. If M=M=\emptyset, then k=0k=0. The number of rounds in an execution becomes the highest round assigned to one of its events.

This metric, however, diverges from CR and NTR.

Example 6.7 (Reliable Broadcast).

In the reliable broadcast primitive [11], a dedicated source broadcasts a message and, if the source is correct, then all correct nodes should deliver the message. Furthermore, if a correct process delivers a message, then every correct process eventually delivers it. The following protocol satisfies this property:

  • When the source invokes broadcast(mm), it delivers mm and sends it to everyone;

  • When a process receives mm for the first time, it delivers mm and sends it to everyone.

In Figure 3, we depict an execution of this protocol with four processes: p1p_{1}, p2p_{2}, p3p_{3} and p4p_{4}. Here, p1p_{1} is the source and broadcasts mm, the message is received by p2p_{2} which then sends mm to everyone. Process p3p_{3} receives mm from p2p_{2} before receiving it from p1p_{1}, and finally, p4p_{4} receives mm from p1p_{1} in the last event. This execution has 22 LCC rounds, while having 11 round according to CR and NTR.

Refer to caption
Figure 3: Example of a reliable broadcast protocol execution.

Example 6.7 shows that the LCC metric diverges from the others in cases where a fast exchange of messages happens in the interval of one (or more) slow message. This is the case for several ASO protocols in the literature (including ours) which heavily rely on relaying values to speed up the validation phase, making the metric unsuitable for our use case. On the other hand, CR and NTR provide equivalent results in covered executions, described next.777The Round and CR metrics also provide equivalent results in covered executions (Appendix C).

6.2 Covered executions and holes

Consider an execution E=C0e1C1elClE=C_{0}e_{1}C_{1}...e_{l}C_{l} illustrated in Figure 4(a) where no process receives a message from another process, i.e., events may add messages to the buffer but no event removes a message from it. δE¯\delta_{\overline{E}} is not defined in any time assignment E¯\overline{E}.

Now consider an execution E=C0e1C1elClemCmE^{\prime}=C_{0}e_{1}C_{1}...e_{l}C_{l}...e_{m}C_{m} in which:

  • A message mm is sent in e1e_{1} and received in ele_{l};

  • A message mm^{\prime} is sent in el+1e_{l+1} and received in eme_{m};

  • No message from e1ele_{1}...e_{l} is received in el+1eme_{l+1}...e_{m}.

In this example, illustrated in Figure 4(b) with 55 events, δE¯\delta_{\overline{E^{\prime}}} exists for any time assignment of EE^{\prime}, but we can still assign an arbitrary time difference to ele_{l} and el+1e_{l+1} without affecting δE¯\delta_{\overline{E^{\prime}}}, which results in the number of CR rounds to be unbounded.

Refer to caption
(a) Execution with undefined δE¯\delta_{\overline{E}}.
Refer to caption
(b) Execution where the number of rounds is unbounded according to Round and CR.
Refer to caption
(c) Covered execution.
Figure 4: Examples of non-covered and covered executions.

The two executions in the examples above have events whose time difference is unrelated to message delays. By consequence, the duration of these executions can grow irrespective of any bound imposed by message exchanges. Similarly, in Figure 4(b), since there is no message being received in e3e4e5e_{3}e_{4}e_{5} from e1e2e_{1}e_{2}, there is no round assignment defined when using NTR to e3e_{3}, e4e_{4} and e4e_{4}.

We then restrict the analysis of these metrics to executions that are covered. Formally:

Definition 6.8 (Covered Execution).

A hole in an execution is a pair (el,el+1)(e_{l},e_{l+1}) in which no event in el+1e_{l+1}... receives a message from el...e_{l}, in other words, there are no message hops among the two sequence of events. An execution is covered iff it has no holes.

Abraham et al. [2] introduce NTR as an equivalent to CR, however, no formal proof is provided. The next result corroborates this claim in covered executions. Later in Example 6.12, we show that using NTR naively to measure time between events may not match CR.

Theorem 6.9.

A finite covered execution EE has kk CR rounds iff it has k\lceil k\rceil NTR rounds.

Proof 6.10.

See Appendix C.1.

6.3 Time between arbitrary events

In long-lived executions (such as those of atomic snapshot algorithms) we are interested in measuring time between two events, for instance, between an application call and response. Definition 6.2 can easily be adapted to measure the number of rounds between two events as follows:

Definition 6.11 (Generalized CR metric).

Let EE be an execution, let 𝒯(E)\mathcal{T}(E) denote the set of all timed executions E¯\overline{E} based on EE, and δE¯\delta_{\overline{E}} - the maximum message delay in E¯\overline{E}. Let eie_{i} and eje_{j} (j>ij>i) be events in EE, and tit_{i} and tjt_{j} time assignments in E¯\overline{E} for them respectively. Then we say that in between eie_{i} and eje_{j} there are: supE¯𝒯(E)(tjti)/δE¯\sup_{\overline{E}\in\mathcal{T}(E)}(t_{j}-t_{i})/\delta_{\overline{E}} CR rounds.

An appealing way of defining time between two events eie_{i} and eje_{j} using a non-timed metric is to assign rounds according to NTR, and then take the difference of rounds assigned to eie_{i} and eje_{j}. As illustrated in Example 6.12, this definition can diverge from generalized CR.

Refer to caption
Figure 5: An execution in which there are 22 CR rounds between e2e_{2} and e4e_{4}. However, the difference of the rounds assigned to e2e_{2} and e4e_{4} using NTR is 11.
Example 6.12.

Consider the execution shown in Figure 5. We can assign times to e1e_{1}, e3e_{3} and e4e_{4} such that the two message hops have delay of δ\delta. Now consider the number of rounds between e2e_{2} and e4e_{4}, since we can assign a time for e2e_{2} that is arbitrarily close to e1e_{1}’s assignment, there are 22 CR rounds between e2e_{2} and e4e_{4}. However, the round assignments using NTR to e2e_{2} and e4e_{4} are 11 and 22 respectively, so simply taking the difference between them leads to a value that diverges from CR.

We then give the following definition, using the approach described in Section 4.3:

Definition 6.13 (Generalized NTR).

Given an execution EE, let eie_{i} and eje_{j} (j>ij>i) be events in EE. The number of rounds between eie_{i} and eje_{j} is given by the round assigned to eje_{j} according to the following:

  • All events up to (and including) eie_{i} are assigned round 0. We also write e0=eie_{0}^{*}=e_{i};

  • For any r1r\geq 1, let ere_{r}^{*} be the last event where a message of round r1r-1 is delivered. All events after er1e_{r-1}^{*} until (and including) event ere_{r}^{*} are in round rr.

Theorem 6.14.

Let EE be a covered execution and eie_{i} and eje_{j} (j>ij>i) be events of EE. There are kk rounds in between eie_{i} and eje_{j} according to CR (Definition 6.11) iff there are k\lceil k\rceil rounds in between them according to NTR (Definition 6.13).

Proof 6.15.

See Appendix C.2.

6.4 Relating IRA to NTR

Theorem 6.16.

Let EE be a finite covered execution and suppose that all events of EE are assigned rounds according to IRA after all iterations of the algorithm. It holds that:

  1. 1.

    Round 0 is composed only of e0e_{0} (the initial event).

  2. 2.

    The final event of round i+1i+1 is the last event to receive a message from round ii.

Proof 6.17.

See Appendix C.3.

Corollary 6.18.

IRA and NTR assign the same rounds to events in covered executions.

References

  • [1] I. Abraham, N. Ben-David, G. Stern, and S. Yandamuri. On the round complexity of asynchronous crusader agreement. Cryptology ePrint Archive, 2023.
  • [2] I. Abraham, K. Nayak, L. Ren, and Z. Xiang. Good-case latency of byzantine broadcast: a complete categorization. CoRR, abs/2102.07240, 2021.
  • [3] Y. Afek, H. Attiya, D. Dolev, E. Gafni, M. Merritt, and N. Shavit. Atomic snapshots of shared memory. J. ACM, 40(4):873–890, 1993.
  • [4] J. Aspnes, H. Attiya, K. Censor-Hillel, and F. Ellen. Limited-use atomic snapshots with polylogarithmic step complexity. J. ACM, 62(1):3:1–3:22, 2015.
  • [5] J. Aspnes and K. Censor-Hillel. Atomic snapshots in o(log3 n) steps using randomized helping. In Y. Afek, editor, Distributed Computing - 27th International Symposium, DISC 2013, Jerusalem, Israel, October 14-18, 2013. Proceedings, volume 8205 of Lecture Notes in Computer Science, pages 254–268. Springer, 2013.
  • [6] H. Attiya, A. Bar-Noy, and D. Dolev. Sharing memory robustly in message-passing systems. J. ACM, 42(1):124–142, jan 1995.
  • [7] H. Attiya, F. Ellen, and P. Fatourou. The complexity of updating snapshot objects. J. Parallel Distributed Comput., 71(12):1570–1577, 2011.
  • [8] H. Attiya, M. Herlihy, and O. Rachman. Atomic snapshots using lattice agreement. Distributed Comput., 8(3):121–132, 1995.
  • [9] H. Attiya and J. Welch. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. John Wiley & Sons, 2004.
  • [10] H. Attiya and J. L. Welch. Multi-valued connected consensus: A new perspective on crusader agreement and adopt-commit. In 27th International Conference on Principles of Distributed Systems, 2024.
  • [11] C. Cachin, R. Guerraoui, and L. Rodrigues. Introduction to reliable and secure distributed programming. Springer Science & Business Media, 2011.
  • [12] R. Canetti and T. Rabin. Fast asynchronous byzantine agreement with optimal resilience. In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, pages 42–51, 1993.
  • [13] K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63–75, 1985.
  • [14] G. Danezis, L. Kokoris-Kogias, A. Sonnino, and A. Spiegelman. Narwhal and tusk: a dag-based mempool and efficient BFT consensus. In EuroSys, pages 34–50. ACM, 2022.
  • [15] C. Delporte-Gallet, H. Fauconnier, S. Rajsbaum, and M. Raynal. Implementing snapshot objects on top of crash-prone asynchronous message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 29(9):2033–2045, 2018.
  • [16] J. M. Faleiro, S. Rajamani, K. Rajan, G. Ramalingam, and K. Vaswani. Generalized lattice agreement. In Proceedings of the 2012 ACM Symposium on Principles of Distributed Computing, PODC ’12, page 125–134, New York, NY, USA, 2012. Association for Computing Machinery.
  • [17] V. Garg, S. Kumar, L. Tseng, and X. Zheng. Amortized constant round atomic snapshot in message-passing systems. arXiv preprint arXiv:2008.11837, 2020.
  • [18] V. K. Garg, S. Kumar, L. Tseng, and X. Zheng. Fault-tolerant snapshot objects in message passing systems. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1129–1139, 2022.
  • [19] M. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–492, 1990.
  • [20] D. Imbs, A. Mostéfaoui, M. Perrin, and M. Raynal. Set-constrained delivery broadcast: Definition, abstraction power, and computability limits. In Proceedings of the 19th International Conference on Distributed Computing and Networking, pages 1–10, 2018.
  • [21] I. Keidar, E. Kokoris-Kogias, O. Naor, and A. Spiegelman. All you need is DAG. In PODC, pages 165–175. ACM, 2021.
  • [22] P. Kuznetsov, T. Rieutord, and S. Tucci-Piergiovanni. Reconfigurable Lattice Agreement and Applications. In P. Felber, R. Friedman, S. Gilbert, and A. Miller, editors, 23rd International Conference on Principles of Distributed Systems (OPODIS 2019), volume 153 of Leibniz International Proceedings in Informatics (LIPIcs), pages 31:1–31:17, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [23] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications, 1978.
  • [24] L. Lamport. Lower bounds for asynchronous consensus. Distributed Computing, 19:104–125, 2006.
  • [25] F. Mattern. Efficient algorithms for distributed snapshots and global virtual time approximation. J. Parallel Distributed Comput., 18(4):423–434, 1993.
  • [26] X. Zheng, V. K. Garg, and J. Kaippallimalil. Linearizable Replicated State Machines With Lattice Agreement. In P. Felber, R. Friedman, S. Gilbert, and A. Miller, editors, 23rd International Conference on Principles of Distributed Systems (OPODIS 2019), volume 153 of Leibniz International Proceedings in Informatics (LIPIcs), pages 29:1–29:16, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

Appendix A Protocol with O(n2)O(n^{2}) message complexity per request

Algorithm 2 has a complexity of (n2)(n^{2}) messages per proposed value, where each proposal might contain an arbitrary number of requests. As a consequence, processes are required to exchange messages that can grow indefinetely in size, resulting in high network bandwidth usage. We address this problem in Algorithm 4, with a few small modifications to Algorithm 2.

Instead of waiting for validation from a quorum and relaying entire proposals, processes keep track of each individual request, and relay only the difference between a received proposal and the current values waiting for validation. The same occurs in the ACCEPT phase of the protocol, where processes only send the difference between new learned values and previous ones. With these modifications, an individual request is now relayed once by every process before proposing, another time at the proposal and validation phase and one last time in the ACCEPT message, for a total of 3n23*n^{2} messages per request.

65:upon Startup
66:  𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔,𝑀𝑃𝑜𝑜𝑙,𝑃𝑒𝑛𝑑𝑖𝑛𝑔,𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔,𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑,𝐿𝑒𝑎𝑟𝑛𝑒𝑑,𝑇𝑜𝐴𝑑𝑜𝑝𝑡\mathit{Proposing},\mathit{MPool},\mathit{Pending},\mathit{Relaying},\mathit{Validated},\mathit{Learned},\mathit{ToAdopt}\leftarrow\emptyset
67:operation Propose(vv)
68:  SendRequest(vv)
69:  wait until v𝐿𝑒𝑎𝑟𝑛𝑒𝑑v\sqsubseteq\bigsqcup\mathit{Learned}
70:  return 𝐿𝑒𝑎𝑟𝑛𝑒𝑑\bigsqcup\mathit{Learned}
71:operation SendRequest(vv)
72:  𝑀𝑃𝑜𝑜𝑙𝑀𝑃𝑜𝑜𝑙{v}\mathit{MPool}\leftarrow\mathit{MPool}\cup\{v\}
73:  send \langleREQUEST,v\textbf{REQUEST},v\rangle to every other node
74:upon Receive \langleREQUEST,v\textbf{REQUEST},v\rangle from a node
75:  if v𝑀𝑃𝑜𝑜𝑙𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝐿𝑒𝑎𝑟𝑛𝑒𝑑v\not\in\mathit{MPool}\cup\mathit{Proposing}\cup\mathit{Learned} then
76:   𝑀𝑃𝑜𝑜𝑙𝑀𝑃𝑜𝑜𝑙{v}\mathit{MPool}\leftarrow\mathit{MPool}\cup\{v\}
77:   send \langleREQUEST,v\textbf{REQUEST},v\rangle to every other node   
78:upon event (𝑀𝑃𝑜𝑜𝑙)(𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔=)(\mathit{MPool}\neq\emptyset)\wedge(\mathit{Proposing}=\emptyset)
79:  𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝑀𝑃𝑜𝑜𝑙\mathit{Proposing}\leftarrow\mathit{MPool}
80:  for v𝑀𝑃𝑜𝑜𝑙v\in\mathit{MPool} do
81:   𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]1\mathit{Pending}[v]\leftarrow 1   
82:  𝑀𝑃𝑜𝑜𝑙\mathit{MPool}\leftarrow\emptyset
83:  send \langlePROPOSE,𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\textbf{PROPOSE},\mathit{Proposing}\rangle to every other node
84:upon Receive \langlePROPOSE,V\textbf{PROPOSE},V\rangle from a node
85:  𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔\mathit{Relaying}\leftarrow\emptyset
86:  for vVv\in V do
87:   if v𝑃𝑒𝑛𝑑𝑖𝑛𝑔.keys()v\in\mathit{Pending}.\textsf{keys}() then
88:     𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]++\mathit{Pending}[v]++
89:   else
90:     𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]1\mathit{Pending}[v]\leftarrow 1
91:     𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔{v}\mathit{Relaying}\leftarrow\mathit{Relaying}\cup\{v\}      
92:  if 𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔\mathit{Relaying}\neq\emptyset then
93:   send \langlePROPOSE,𝑅𝑒𝑙𝑎𝑦𝑖𝑛𝑔\textbf{PROPOSE},\mathit{Relaying}\rangle to every node   
94:upon exists vv s.t. 𝑃𝑒𝑛𝑑𝑖𝑛𝑔[v]=nf\mathit{Pending}[v]=n-f
95:  𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑{v}\mathit{Validated}\leftarrow\mathit{Validated}\cup\{v\}
96:upon event 𝑃𝑒𝑛𝑑𝑖𝑛𝑔.keys()𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Pending}.\textsf{keys}()\subseteq\mathit{Validated}
97:  if 𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Learned}\subset\mathit{Validated} then
98:   Δ𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝐿𝑒𝑎𝑟𝑛𝑒𝑑\Delta\mathit{Learned}\leftarrow\mathit{Validated}-\mathit{Learned}
99:   𝑇𝑜𝐴𝑑𝑜𝑝𝑡𝑇𝑜𝐴𝑑𝑜𝑝𝑡Δ𝐿𝑒𝑎𝑟𝑛𝑒𝑑\mathit{ToAdopt}\leftarrow\mathit{ToAdopt}-\Delta\mathit{Learned}
100:   𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑\mathit{Learned}\leftarrow\mathit{Validated}
101:   𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing}\leftarrow\emptyset
102:   send \langleACCEPT,Δ𝐿𝑒𝑎𝑟𝑛𝑒𝑑\textbf{ACCEPT},\Delta\mathit{Learned}\rangle to every node   
103:upon Receive \langleACCEPT,W\textbf{ACCEPT},W\rangle from a node
104:  𝑇𝑜𝐴𝑑𝑜𝑝𝑡𝑇𝑜𝐴𝑑𝑜𝑝𝑡W\mathit{ToAdopt}\leftarrow\mathit{ToAdopt}\cup W
105:  if (𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔𝑇𝑜𝐴𝑑𝑜𝑝𝑡)(\mathit{Proposing}\subseteq\mathit{ToAdopt}) then
106:   𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑𝑇𝑜𝐴𝑑𝑜𝑝𝑡\mathit{Validated}\leftarrow\mathit{Validated}\cup\mathit{ToAdopt}
107:   𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝐿𝑒𝑎𝑟𝑛𝑒𝑑𝑇𝑜𝐴𝑑𝑜𝑝𝑡\mathit{Learned}\leftarrow\mathit{Learned}\cup\mathit{ToAdopt}
108:   𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑛𝑔\mathit{Proposing}\leftarrow\emptyset
109:   send \langleACCEPT,𝑇𝑜𝐴𝑑𝑜𝑝𝑡\textbf{ACCEPT},\mathit{ToAdopt}\rangle to every node   
Algorithm 4 Refined Long-Lived LA: code for node xx.

Appendix B Time Complexity of Algorithm 2

We show that an operation op in Algorithm 2 takes O(k)O(k) rounds to complete, where kk is the number of active faulty nodes during op.

Messages from and to faulty nodes may not arrive, however, a message sent by (and to) a faulty node at round rr is received at most by round r+1r+1. Moreover, since channels are FIFO, when a node ii receives a message from another node jj, ii must also have received all previous messages jj sent to ii, irrespective of them being correct or faulty.

If a correct node receives PROPOSE,v\langle\textbf{PROPOSE},v^{\prime}\rangle (even from a faulty node) in round rr, every correct node will have vv^{\prime} added to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} by the end of round r+1r+1, and will have vv^{\prime} validated by the end of round r+2r+2. Also, faulty nodes wait for its current proposal to finish before starting a new one, in which case they send an ACCEPT message for the last learned value before sending the new proposal.

We say that a node introduces a new value ww during the operation if it is the first node to send a PROPOSE,w\langle\textbf{PROPOSE},w\rangle for ww in the interval of the operation. A node can introduce a new value coming from an internal source, i.e., the value was buffered and proposed when the node had already finished its previous proposal, or from an external source, i.e., after receiving a proposal originated from another node before the operation started.

Let vv be the value received from the application call for op and eCe_{C} (as well as all previous events) be assigned round 0. If there are no active faulty nodes, a correct node learns a value containing vv by at most round 77 (by Lemma 4.15, here, we include the time vv can remain buffered). Also by the end of round 55, every correct node has sent a PROPOSE message for vv and has vv validated by the end of round 66 (including buffering time, a correct node proposes vv in round 44 at the latest). By that point, all correct nodes are waiting for their proposals to complete and, therefore, cannot introduce a value from an internal source. In order to delay a correct node from leaning a value containing vv by round 77, every correct node should receive a new value in a PROPOSE message before, which is added to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} but is not validated. Using a simple inductive argument, 2k+12k+1 new proposals originated from faulty nodes are necessary to delay a correct node from learning a value from round 77 to round 7+2k7+2k.

Suppose that there is an execution where it takes 8+2k+18+2k+1 rounds for node ii to complete an operation. But there are only kk active faulty nodes, which means that at least k+1k+1 extra proposals were introduced by active faulty nodes.

Let f0f_{0} be an active faulty node that introduced more than one of the 2k+12k+1 values that delayed the operation (assuming w.l.o.g. that there are exactly 2k+12k+1 new proposals). Let ww and ww^{\prime} be the first and the second values introduced by f0f_{0} respectively. If ww^{\prime} was received by f0f_{0} from an internal source, f0f_{0} should have finished its previous proposal (and learned a value containing ww) before proposing ww^{\prime}. But because ww was one of the values that delayed the operation, and since channels are FIFO, f0f_{0} needs to add vv to 𝑃𝑒𝑛𝑑𝑖𝑛𝑔\mathit{Pending} before validating ww (at least a majority of correct nodes sent a PROPOSE for vv before sending a PROPOSE for ww). f0f_{0} then learns a value containing vv and sends ACCEPT with that value to everyone. The ACCEPT message is received by correct processes before PROPOSE,w\langle\textbf{PROPOSE},w^{\prime}\rangle, and they would be able to adopt it.

So f0f_{0} must have received PROPOSE,w\langle\textbf{PROPOSE},w^{\prime}\rangle from an external source at most by round 11, which means it issued proposals for ww^{\prime} that can be received by at most round 22. We can also conclude that at least k+1k+1 values were introduced by active faulty nodes from external sources. Now let wk+1w_{k+1} be the (k+1)(k+1)th such value used to delay correct nodes from learning vv. The earliest round wk+1w_{k+1} can delay is 7+k7+k, which means that by round 7+k7+k all correct nodes already sent a propose for wk+1w_{k+1}, but by the end of round 5+k5+k no correct node has done it (otherwise wk+1w_{k+1} would have been validated in round 7+k7+k by every correct process). Take the first active faulty node f1f_{1} from which a correct node received PROPOSE,wk+1\langle\textbf{PROPOSE},w_{k+1}\rangle. Since the earliest this message is received is in round 6+k6+k, the earliest it could be sent is in round 5+k5+k, so f1f_{1} first received PROPOSE,wk+1\langle\textbf{PROPOSE},w_{k+1}\rangle from another distinct active faulty node, f2f_{2}, which sent it in round 4+k4+k the earliest. But wk+1w_{k+1} was introduced from an external source and it needs to be received by a faulty node at round 11. Following the chain above, for the node fk+6f_{k+6} to receive it in round 11, there would be necessary a chain of k+6k+6 active nodes, although there are only kk.

Therefore, an operation takes less than 8+2k+18+2k+1 rounds to complete.

Appendix C Equivalence Proofs for Time Measurement Metrics

In this section, we present detailed proofs for the equivalence between CR and NTR in covered executions. The proofs are written with respect to a new (non-timed) method for interpreting latency: the minimum number of message hops that can cover an execution. Before proceeding, we establish the equivalence between the Round and CR metrics.

Theorem C.1.

Round and CR assign the same number of rounds to finite covered executions.

Proof C.2.

Let EE be a finite covered execution and E¯\overline{E} a time assignment for EE, with maximum message delay δE¯\delta_{\overline{E}}. Since we consider algorithms that do not make use of clocks, we can “shrink” or “stretch” time assignments without altering the steps in the underlying execution. Consider the time assignment E¯\overline{E}^{\prime} built as following:

  1. 1.

    tstart(E¯)=tstart(E¯)t_{\textit{start}}(\overline{E}^{\prime})=t_{\textit{start}}(\overline{E});

  2. 2.

    For every event e¯l\overline{e}_{l} in E¯\overline{E} with time tlt_{l}, have e¯l\overline{e}_{l}^{\prime} in E¯\overline{E}^{\prime} with time tl=tstart+(tltstart)1δE¯t_{l}^{\prime}=t_{\textit{start}}+(t_{l}-t_{\textit{start}})\frac{1}{\delta_{\overline{E}}}.

We call E¯\overline{E}^{\prime} a normalization of E¯\overline{E}. By construction, the maximum message delay in E¯\overline{E}^{\prime} is 11 and E¯\overline{E}^{\prime} has the same number of CR rounds than E¯\overline{E}.

Now let 𝒯(E)\mathcal{T}(E) be the set of valid executions for the Round metric and E¯𝒯(E)\overline{E}\in\mathcal{T}(E) have kk rounds (using Round). If δE¯<1\delta_{\overline{E}}<1, then the normalization E¯\overline{E}^{\prime} of E¯\overline{E} has k>kk^{\prime}>k rounds: duration(E¯)=tend(E¯)tstart(E¯)=(tend(E¯)tstart(E¯))/δE¯\textit{duration}(\overline{E}^{\prime})=t_{end}(\overline{E}^{\prime})-t_{start}(\overline{E}^{\prime})=(t_{end}(\overline{E})-t_{start}(\overline{E}))/\delta_{\overline{E}}.

Consider the set 𝒯(E)\mathcal{T}^{\prime}(E) of valid executions for the Round metric where for all E¯𝒯(E)\overline{E}^{\prime}\in\mathcal{T}^{\prime}(E), δE¯=1\delta_{\overline{E}^{\prime}}=1. Since for every timed execution E¯𝒯(E)\overline{E}\in\mathcal{T}(E) with kk rounds, there is a timed execution E¯𝒯(E)\overline{E}^{\prime}\in\mathcal{T}^{\prime}(E) with kk^{\prime} rounds where kkk^{\prime}\geq k, then: supE¯𝒯(E)duration(E¯)=supE¯𝒯(E)duration(E¯)\sup_{\overline{E}^{\prime}\in\mathcal{T}^{\prime}(E)}{\textit{duration}(\overline{E}^{\prime})}=\sup_{\overline{E}\in\mathcal{T}(E)}{\textit{duration}(\overline{E})}.

Easy to see that every execution E¯𝒯(E)\overline{E}^{\prime}\in\mathcal{T}^{\prime}(E) has the same number of rounds according to both CR and Round metrics. So if the Round metric assigns kk rounds to EE and CR assigns kk^{\prime}, kkk^{\prime}\geq k. But we also know that for any time assignment E¯\overline{E} of EE, the normalization of E¯\overline{E} is a valid timed-execution for the Round metric and has the same number of CR rounds as E¯\overline{E}. This means that kkk\geq k^{\prime}, since for any time assignment E¯\overline{E}, there is a time assignment with the same number of rounds in both Round and CR metrics, therefore k=kk=k^{\prime}.

C.1 A new look at execution latency

In covered executions, it seems natural to relate the number of message hops to the number of communication rounds. Next, we define the concept of covering executions and events with message hops.

Consider the finite execution E=e0,,elE=e_{0},\ldots,e_{l}. We can visualize these events as points on a real line, where their positions correspond to their indices, that is, e0e_{0} at 0, e1e_{1} at 11 and so on. Each pair of events (ei,ej)(e_{i},e_{j}) defines an interval [i,j][i,j], and we denote this by interval((ei,ej))=[i,j]\textsf{interval}((e_{i},e_{j}))=[i,j]. Likewise, EE defines the interval [0,l][0,l], which we represent as interval(E)=[0,l]\textsf{interval}(E)=[0,l].

Since a message hop consists of a pair of events (ei,ej)(e_{i},e_{j}), it also specifies an interval [i,j][i,j]. For a set MM of message hops, we define interval(M)=mMinterval(m)\textsf{interval}(M)=\bigcup_{m\in M}\textsf{interval}(m).

Definition C.3 (Execution cover).

Let EE be a finite execution and MM a set of message hops from EE. We say that MM covers EE if interval(M)=interval(E)\textsf{interval}(M)=\textsf{interval}(E). Analogously, we say that EE can be covered by kk message hops if |M|=k|M|=k.

Theorem C.4.

If a covered execution EE has kk rounds according to the Round metric, then k\lceil k\rceil message hops are necessary and sufficient to cover EE.

Proof C.5.

Let EE have kk rounds according to Round. There is a time assignment E¯\overline{E} with duration(E¯)=kϵ\textit{duration}(\overline{E})=k-\epsilon, where ϵ>0\epsilon>0 can be arbitrarily small. Starting from tstartt_{\textit{start}}, assume that there exists a set of message hops where each hop can cover the maximum amount of time, that is, an interval of one unit, then at least duration(E¯)=k\lceil\textit{duration}(\overline{E})\rceil=\lceil k\rceil message hops are necessary to cover the whole duration.

Now, we proceed to build a set MM that covers EE with k\lceil k^{\prime}\rceil message hops, and show next that there exists a timed execution with k′′k^{\prime\prime} rounds, where k′′=k\lceil k^{\prime\prime}\rceil=k^{\prime}.

For the first element of MM, we take pair p1=(e1,e1)p_{1}=(e_{1}^{\prime},e_{1}^{*}) where e1=e0e_{1}^{\prime}=e_{0} (the initial event) and e1e_{1}^{*} is the last event in EE where a message from e1e_{1}^{\prime} is received, we also define e0=e0e_{0}^{*}=e_{0}. Now, we inductively take pair pi=(ei,ei)p_{i}=(e_{i}^{\prime},e_{i}^{*}) where eie_{i}^{*} is the last event to receive a message originated in ei2ei1e_{i-2}^{*}...e_{i-1}^{*} and eie_{i}^{\prime} is the first corresponding event to have sent such message (eie_{i}^{*} may receive more than one). We continue to select pairs until pair pk=(ek,ek)p_{k^{\prime}}=(e_{k^{\prime}}^{\prime},e_{k^{\prime}}^{*}) where eke_{k^{\prime}}^{*} is the last event of EE (note that this construction is possible since EE is covered).

The set M={(e1,e1),,(ek,ek)}M=\{(e_{1}^{\prime},e_{1}^{*}),...,(e_{k^{\prime}}^{\prime},e_{k^{\prime}}^{*})\} clearly covers EE, implying that EE can be covered by kk^{\prime} message hops. We now show that there exists a time assignment E¯\overline{E} with k′′k^{\prime\prime} rounds in which k′′=k\lceil k^{\prime\prime}\rceil=k^{\prime}.

Consider the following time assignment:

  • t0=t1=t0=0t_{0}=t_{1}^{\prime}=t_{0}^{*}=0

  • t1=1t_{1}^{*}=1

Take the sub-sequence E1E_{1} containing all events in e0e1e_{0}^{*}...e_{1}^{*} except for e0e_{0}^{*}. Note that e2e_{2}^{\prime} appears in E1E_{1} and by construction, every message originated in E1E_{1} that is received in the execution is received before e2e_{2}^{*} or at e2e_{2}^{*}.

We now enumerate the events in E1E_{1} in reverse order: e1e_{1}^{*} is assigned 0, the event preceding e1e_{1}^{*} receives 11 and so on until the first event of E1E_{1} receives n1n_{1}. Assign time to these events according to their enumeration j1j_{1} as following:

  • tj1=t1ϵ1j1t_{j_{1}}=t_{1}^{*}-\epsilon_{1}j_{1}, where 0<ϵ1n1<t10<\epsilon_{1}n_{1}<t_{1}^{*} if n1>0n_{1}>0 and ϵ1=0\epsilon_{1}=0 otherwise.

We make so that t2=t1ϵ1n1+1t_{2}^{*}=t_{1}^{*}-\epsilon_{1}n_{1}+1, so that every message hop originated from E1E_{1} satisfies the upper bound on message delay.

In general, let EiE_{i} be the sub-sequence containing all events in ei1eie_{i-1}^{*}...e_{i}^{*} except for ei1e_{i-1}^{*}. Enumerate the events in EiE_{i} in the following order: ei1e_{i-1}^{*} receives 0, the event preceding it receives 11 and so on until the first event in wiw_{i} receives nin_{i}. Assign time to these events according to their enumeration jij_{i} as follows:

  • tji=tiϵijit_{j_{i}}=t_{i}^{*}-\epsilon_{i}j_{i}, where 0<ϵini<(titi1)0<\epsilon_{i}n_{i}<(t_{i}^{*}-t_{i-1}^{*}) if ni>0n_{i}>0 and ϵi=0\epsilon_{i}=0 otherwise.

We make so that ti=ti1ϵi1ni1+1t_{i}^{*}=t_{i-1}^{*}-\epsilon_{i-1}n_{i-1}+1. For simplicity, assume that every ni>0n_{i}>0 (in the case where some ni=0n_{i}=0, we just make ti=ti1+1t_{i}*=t_{i-1}^{*}+1 and the following analysis works analogously). From the time assignments above:

duration(E¯)=k′′=tkt0\displaystyle\textit{duration}(\overline{E})=k^{\prime\prime}=t_{k^{\prime}}^{*}-t_{0}^{*} (1)
tk=k(ϵ1n1++ϵk1nk1)\displaystyle t_{k^{\prime}}^{*}=k^{\prime}-(\epsilon_{1}n_{1}+...+\epsilon_{k^{\prime}-1}n_{k^{\prime}-1}) (2)

With the following constraint, for all i=1,,ki=1,...,k^{\prime} (we make ϵ0n0=0\epsilon_{0}n_{0}=0):

0<ϵini<1ϵi1ni10<\epsilon_{i}n_{i}<1-\epsilon_{i-1}n_{i-1} (3)

Let us make ϵ=ϵ1=ϵ2==ϵk1\epsilon=\epsilon_{1}=\epsilon_{2}=...=\epsilon_{k^{\prime}-1}, and let nmax=max(n1,,nk1)n_{max}=max(n_{1},...,n_{k^{\prime}-1}). The conditions in (3) can be satisfied by making:

ϵnmax<1ϵnmax\displaystyle\epsilon n_{max}<1-\epsilon n_{max} (4)
ϵ<12nmax\displaystyle\epsilon<\frac{1}{2n_{max}} (5)

In order to make k′′=k\lceil k^{\prime\prime}\rceil=k^{\prime}, the difference between kk^{\prime} and k′′k^{\prime\prime} needs to be in the interval:

0(kk′′)<10\leq(k^{\prime}-k^{\prime\prime})<1 (6)

Thus,

kk′′=kk+(ϵ1n1++ϵk1nk1)<1\displaystyle k^{\prime}-k^{\prime\prime}=k^{\prime}-k^{\prime}+(\epsilon_{1}n_{1}+\ldots+\epsilon_{k^{\prime}-1}n_{k^{\prime}-1})<1 (7)
(ϵn1++ϵnk1)<1\displaystyle(\epsilon n_{1}+\ldots+\epsilon n_{k^{\prime}-1})<1 (8)
(ϵn1++ϵnk1)<(k1)ϵnmax\displaystyle(\epsilon n_{1}+\ldots+\epsilon n_{k^{\prime}-1})<(k^{\prime}-1)\epsilon n_{max} (9)

To satisfy (8), we can make so that:

(k1)ϵnmax<1\displaystyle(k^{\prime}-1)\epsilon n_{max}<1 (10)
ϵ<1(k1)nmax\displaystyle\epsilon<\frac{1}{(k^{\prime}-1)n_{max}} (11)

From (5) and (11):

ϵ<min(12nmax,1(k1)nmax)\epsilon<min(\frac{1}{2n_{max}},\frac{1}{(k^{\prime}-1)n_{max}}) (12)

As long as inequality (12) is satisfied, the time assignments we have chosen guarantee that k′′=k\lceil k^{\prime\prime}\rceil=k^{\prime}. Because k′′kk^{\prime\prime}\leq k and kk^{\prime} message hops cover EE for any time assignment, it also follows that k=k\lceil k\rceil=k^{\prime}.

Corollary C.6.

If a covered execution EE has kk CR rounds, then k\lceil k\rceil message hops are necessary and sufficient to cover EE.

The next results corroborate the equivalence between NTR and CR.

Theorem C.7.

Let EE be a finite covered execution. EE has kk rounds in the NTR metric iff kk message hops are necessary and sufficient to cover it.

Proof C.8.

Let events in EE be assigned rounds according to NTR, resulting in kk rounds. We can select a set MM of kk message hops as following (sufficiency):

  • Take the first pair p1=(e0,e1)p_{1}=(e_{0}^{*},e_{1}^{*});

  • Take pair pi=(ei,ei)p_{i}=(e_{i}^{\prime},e_{i}^{*}), where eie_{i}^{*} is the last event of round ii, and eie_{i}^{\prime} is the first event from which a message is received in eie_{i}^{*} (eie_{i}^{\prime} has to be an event of round i1i-1).

Since there are kk rounds, MM has kk message hops. It is also easy to see that MM covers EE.

Suppose that a sequence MM of kk^{\prime} message hops can cover EE with k<kk^{\prime}<k. If we assume that each pair (el,em)(e_{l},e_{m}) are assigned either with the same number of rounds or eme_{m} has one round higher than ele_{l}, then since k<kk^{\prime}<k, there would be an entire round that is not covered by any message hop. On the other hand, a pair (el,em)(e_{l}^{\prime},e_{m}^{\prime}) cannot have eme_{m}^{\prime} assigned two (or more) rounds higher than ele_{l}^{\prime} by definition, since eme_{m}^{\prime} receives a message from ele_{l}^{\prime} (necessity).

Now let kk message hops be necessary and sufficient to cover EE, and assume that EE has kk^{\prime} NTR rounds. Then k=kk=k^{\prime}, since kk^{\prime} rounds are necessary and sufficient to cover EE.

Corollary C.9.

A finite covered execution has kk CR rounds iff it has k\lceil k\rceil NTR rounds.

C.2 Latency between arbitrary events

We generalize Definition C.3 to account for the time between any two events in an execution.

Definition C.10 (Event cover).

Let EE be a finite execution, eie_{i} and eje_{j} (j>ij>i) events in EE and MM a set of message hops from EE. We say that MM covers (ei,ej)(e_{i},e_{j}) if interval(eiej)interval(M)\textsf{interval}(e_{i}\ldots e_{j})\subseteq\textsf{interval}(M). Analogously, we say that (ei,ej)(e_{i},e_{j}) can be covered by kk message hops if |M|=k|M|=k.

Theorem C.11.

Let EE be a covered execution and eie_{i} and eje_{j} be events in EE. There are kk CR rounds in between eie_{i} and eje_{j} iff k\lceil k\rceil message hops are necessary and sufficient to cover them.

The proof of Theorem C.11 is similar to that of Theorem C.4 and is omitted (we can consider a covered sub-sequence of EE with kk rounds as a covered execution).

Theorem C.12.

Let EE be a covered execution and ee and ee^{\prime} be events in EE. If there are kk rounds in between ee and ee^{\prime} according to NTR (Definition 6.13) then kk message hops are necessary and sufficient to cover ee and ee^{\prime}.

Proof C.13.

Let ee be assigned round 0 (as well as all previous events) and ee^{\prime} round kk. Take e1e_{1}^{*}, the last event of round 11, and the earliest event e1e_{1}^{\prime} from which e1e_{1}^{*} received a message. Since e1e_{1}^{*} receives a message from round 0, e1e_{1}^{\prime} must be assigned round 0.

Inductively, take eie_{i}^{*}, the last event of round ii, and the earliest event eie_{i}^{\prime} from which eie_{i}^{*} receives a message. Since eie_{i}^{*} receives a message from round i1i-1 (by definition), eie_{i}^{\prime} must be assigned round i1i-1.

Consider the set M={(e1,e1),,(ek,ek)}M=\{(e_{1}^{\prime},e_{1}^{*}),\ldots,(e_{k}^{\prime},e_{k}^{*})\}, MM clearly covers ee and ee^{\prime} (sufficiency).

Now consider a set MM^{\prime} with kk^{\prime} message hops such that MM^{\prime} covers ee and ee^{\prime}. Since MM^{\prime} covers the two events, there must be a message hop whose first event (the sender event) is in round 0. This is true for any round up to k1k-1: suppose that there is a round ii where no message hop in MM^{\prime} has the first event in round ii, then since e,,ee,\ldots,e^{\prime} is covered, there exists a message originated from a previous round j<ij<i that is received in a round l>il>i. But then li+1l\leq i+1 by definition of the metric, a contradiction. Thus, MM^{\prime} includes at least one message hop for each round from 0 to k1k-1, so kkk^{\prime}\geq k (necessity).

Corollary C.14.

Let EE be a covered execution and eie_{i} and eje_{j} be events in EE. There are kk CR rounds in between eie_{i} and eje_{j} iff there are k\lceil k\rceil rounds in between them according to NTR.

Finally, we prove Theorem 6.16, relating IRA to NTR.

C.3 Proof of Theorem 6.16

Let EE be a finite covered execution and suppose that all events of EE are assigned rounds according to IRA after all iterations of the algorithm. It holds that:

  1. 1.

    Round 0 is composed only of e0e_{0} (the initial event).

  2. 2.

    The final event of round i+1i+1 is the last event to receive a message from round ii.

Proof C.15.

1. The case where EE has a single event is immediate, next, we consider executions with more than one event. From the algorithm, e0=e0e_{0}^{*}=e_{0} (e0e_{0}^{*} does not change). Since EE is covered, there is at least one event which receives a message from e0e_{0}. Let ee^{\prime} be the last such event. When the algorithm arrives at the iteration for ee^{\prime}, since the oldest message is from round 0 (from e0e_{0}), all events after e0e_{0}^{*} until ee^{\prime} are assigned round 11 (line 62). Since no event can receive round 0 in later iterations, e0e_{0} is the only event remaining with round 0 assigned.

2. As shown above, there is a single event in round 0. Let ee^{\prime} be the last event to receive a message from e0e_{0}, in its iteration ee^{\prime} then receives round 11. The events following ee^{\prime} (assuming ee^{\prime} is not the last event) might momentarily be assigned to round 11 (if they do not receive any message, line 57), but since the execution is covered, there must be an event after ee^{\prime} that receives a message from e1ee_{1}...e^{\prime}. Let e′′e^{\prime\prime} be the last such event, in its iteration, e′′e^{\prime\prime} is assigned round 22, and all events after ee^{\prime} (which is the last event e1e_{1}^{*} is assigned to, line 63) also receive round 22. No later iteration can assign round 11 to those events since no other event receives a message from round 0, thus ee^{\prime} is the last event to remain with round 11 assigned.

Now assume that the final event eie_{i}^{*} of round ii is the last to receive a message from round i1i-1, and that there is an event assigned to round i+1i+1. Suppose that ee^{*}, the last event to receive a message from round ii, is not the final event of round i+1i+1. Since ee^{*} receives a message from round ii but not from an event before round ii, it has to be assigned round i+1i+1 and ei+1e_{i+1}^{*} receives ee^{*} in line 63 of the algorithm. It follows that the final event of round i+1i+1 comes after ee^{*} and receives no message from e0,,eie_{0},\ldots,e_{i}^{*}. Because the execution is covered, there must be at least one event after the final event of round i+1i+1 that receives a message from round i+1i+1. Once more, consider the last such event, so all events after ei+1e_{i+1}^{*} until this event are assigned round i+2i+2, leaving ee^{*} as the final event of round i+1i+1.

Appendix D One-Shot Lattice Agreement

In the One-Shot Lattice Agreement problem, every process starts with the proposal of an initial value and terminate when it learns a value, such that Validity, Consistency and Liveness are satisfied (Section 3). In this section, we analyze time complexity of one-shot LA protocols, as the abstraction can be used as a building block for implementing ASO [8, 17].

In every protocol execution, all processes start proposing a value simultaneously, i.e., in the initial event. We measure the time for all correct processes to learn a value in the fault-free and worst-case latency. In fault-free executions, all processes are correct and every message sent in the execution must arrive. On the other hand, in the worst-case, there is a set of correct processes PP and a set of potentially faulty processes FF, where PP has f+1f+1 processes and FF has ff processes. All messages exchanged within PP arrive, but this is not the case for exchanges within FF or between PP and FF.

We show that: 1) the protocol presented in [16] has a constant latency in fault-free runs, as opposed to the O(n)O(n) complexity claimed in the paper. 2) With the conventional model of reliable channels assumed in this paper, the protocol of [17] has Ω(f)\Omega(f) time complexity in the worst-case, as opposed to O(f)O(\sqrt{f}) when assuming their model. 3) [20]’s protocol has Ω(f)\Omega(f) time complexity in the worst-case, which is not analyzed in their paper.

D.1 One-Shot Lattice Agreement by Faleiro et al. [16]

Figure 6 shows the one-shot LA description which was extracted from [16]. The protocol describes the roles of proposers and acceptors, but we assume that all processes perform both roles.

A proposer proceeds in rounds, were each round consists in sending a proposed value to every acceptor and waiting for the reply from a majority of them. If all replies are acknowledgments, the process can learn the current proposed value. On the other hand, if there is a NACK with an unseen value, the proposer joins it with the previously proposed value and re-sends them.

An acceptor stores the join of every proposed value it receives. When a proposal is received such that it contains all the stored values, the acceptor replies with an acknowledgment, otherwise it sends the stored values back to the proposer in a NACK message.

Refer to caption
Figure 6: One-shot LA algorithm as presented in [16].
Theorem D.1.

The one-shot LA protocol of [16] takes at most 66 rounds in fault-free runs.

Proof D.2.

All processes start their proposal in the initial event, which is in round 0. Regardless of the order of messages, in the last event of round 1, every process will have received everyone’s first proposal and their local 𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑𝑉𝑎𝑙𝑢𝑒\mathit{acceptedValue} is a join of all initial values. So every reply made in round 22 onward will contain all values.

Consider a process pp, every process receives pp’s proposal in round 11 and reply, so that pp refines its proposal (and propose again) in round 22 at most. If pp re-proposes in round 11, suppose that QQ is the set of processes from which pp receives the replies (for this refined proposal), then either: no reply from QQ is made in round 22 (only round 11), or some reply is from round 22.

In the first case, since all replies come from round 11, the refinement and new proposal must happen in either round 11 (in which case we come back to the situation above) or round 22. In the second, pp receives all values and re-propose in at most round 33, and since the proposal contains all values, pp learns a value by at most round 55.

Now the only remaining case is when pp initiates a new proposal in round 22 with some value missing. In this case all replies will be a join of all values, and by at most round 44, pp will refine the proposal, learning a value by at most round 66.

D.2 One-Shot Lattice Agreement by Garg et al. [17, 18]

Garg et al. [17, 18] assume a stronger underlying reliable channel for communication than in this paper. In their papers, the channel is responsible for delivering a message sent from one process to another, thus messages sent by faulty processes (to correct ones) are guaranteed to arrive in an infinite execution. In the following, we analyze their protocol under the more conventional assumption that messages from faulty processes may never be received.

Figure 7 (extracted from [17]) shows a description of their one-shot LA. Every process ii maintains a local view array, where each position jj in the array contains the values ii received from jj. In the start of the protocol, every process sends its initial value to everyone. Processes relay (execute block from lines 55 to 77) every new value they receive from other processes, and can learn a value once their local view satisfy a predicate called equivalence quorum. Intuitively, the local view VV of process ii satisfies the predicate if there is a quorum in which V[i]=V[j]V[i]=V[j] for every process jj in the quorum.

Refer to caption
Figure 7: LA algorithm as presented in [17].
Theorem D.3.

The one-shot LA protocol of [17] takes at most 22 rounds in fault-free runs.

Proof D.4.

Every process sends their initial value in round 0. By the end of round 11, every process has already received and relayed all other values, so that by the end of round 22, all the local views contains every value, resulting in all processes learning a value.

Theorem D.5.

The one-shot LA protocol of [17] has Ω(f)\Omega(f) worst-case latency.

Proof D.6.

We proceed to build an execution that takes at least f/2f/2 rounds to complete. Assume w.l.o.g. that the number of faulty processes is even. Split FF into two groups A={l1,,lf/2}A=\{l_{1},\ldots,l_{f/2}\} and B={lf/2+1,,lf}B=\{l_{f/2+1},\ldots,l_{f}\} with f/2f/2 processes each. In round 0, every process sends its initial value to everyone.

[Round 11] At the beginning of the round, the value (x1,l1)(x_{1},l_{1}) from l1l_{1} is received and relayed by every process in BB, as well as by a single correct process lcl_{c}. All remaining values (xi,li)(x_{i},l_{i}) from processes in AA are received by a single process lf/2+1Bl_{f/2+1}\in B, which relays them. Processes in AA crash just after lf/2+1l_{f/2+1} receives their values, and no other process receives any message from them. At the end of the round, initial values from every non-crashed process (including those in BB) are received and relayed by every non-crashed process.

[Round 22] At the beginning, the first of the remaining values (x2,l2)(x_{2},l_{2}) that lf/2+1l_{f/2+1} relayed is received (and relayed) by every non-crashed process in BB, as well as by lcl_{c}. A single process lf/2+2l_{f/2+2} receives all other f/22f/2-2 values from lf/2+1l_{f/2+1} and relay them, then lf/2+1l_{f/2+1} crashes and no other process receives messages from it. Finally, every non-crashed process receives the values relayed by other non-crashed processes in the previous round. By the end of round 22, any non-crashed process have in its view V[j]V[j] all initial values sent by non-crashed processes, but for a single correct process lcl_{c} and all non-crashed processes in BB, their position in the view also contains (x1,l1)(x_{1},l_{1}). Therefore, no equivalence quorum exists in any local view.

[Round i+1i+1] At the beginning, the first of the remaining values (xi+1,li+1)(x_{i+1},l_{i+1}) that lf/2+il_{f/2+i} relayed is received (and relayed) by every non-crashed process in BB, as well as by lcl_{c}. A single process lf/2+i+1l_{f/2+i+1} receives all other f/2(i+1)f/2-(i+1) values from lf/2+il_{f/2+i} and relay them, then lf/2+il_{f/2+i} crashes and no other process receives messages from it. At the end of the round, every non-crashed process receives the values relayed by other non-crashed processes in the previous round, but messages from lcl_{c} are received before any other. Any non-crashed process have in its view V[j]V[j] all initial values sent by non-crashed processes, with addition of (x1,l1),,(xi1,li1)(x_{1},l_{1}),\ldots,(x_{i-1},l_{i-1}), but for a single correct process lcl_{c} and all remaining non-crashed processes in BB, their position in the view also contains (xi,li)(x_{i},l_{i}). Note that, since messages from lcl_{c} are received first, non-crashed processes receive and relay (xi,li)(x_{i},l_{i}) before forming an equivalence quorum for previous values, and no process is able to learn a value this round.

We can use the above method to delay the execution by f/2f/2 rounds.

D.3 One-Shot Lattice Agreement by Imbs et al. [20]

The protocol displayed in Figure 8 (extracted from [20]) solves the problem of Set-Constrained Delivery Broadcast (SCD-Broadcast). It can easily be adapted to solve one-shot LA by adding to the condition of line 1717 that the initial value must be in the output.

The authors use a FIFO broadcast primitive for forwarding messages, so in the following proofs we will assume message channels to be FIFO. We say that a process relays a value when it executes line 1111 of the algorithm (it sends a new received value to everyone). The fundamental blocks of the protocol include:

  • Each process has a logical clock which ticks every time a new value is received and relayed (including its own initial value). The current clock value is attached to the relaying message (called a forward message).

  • Each process stores a set of value views: an array of logical clock values, one position for each process.

  • The following predicate needs to hold in order to output a set o values OO: Let AA be the set of all received values and VV be the set of values received by a quorum. An output is a non-empty set OVO\subseteq V satisfying: wO,vAO:\forall w\in O,\forall v\in A-O: there is a quorum in which each individual clock value for ww is smaller than the corresponding value for vv.

Refer to caption
Figure 8: SCD algorithm as presented in [20].

Not that each process starts sending its initial value to itself before relaying it to everyone. For simplicity, we consider these two actions to be in a single event (the initial event), where the first message sent to itself is ignored.

Theorem D.7.

The SCD-Broadcast protocol in [20] takes at most 22 rounds in the fault-free runs

Proof D.8.

In the first event (round 0), every process forwards its own initial value to everyone. At the end of round 11, all processes have already received all initial values and relayed them. At the end of round 22, regardless of the order, all processes received all values from everyone. As a consequence AV=A-V=\emptyset in their local view, so all processes can output VV.

Theorem D.9.

The SCD-Broadcast protocol in [20] has Ω(f)\Omega(f) worst-case latency.

Proof D.10.

We proceed to build an execution that takes at least f/4f/4 rounds to complete. When we say that a process crashes at some point in the execution, the process no longer takes any more steps and no further messages are received from it unless explicitly stated.

Assume w.l.o.g. that the number of faulty nodes is even. Split FF into four groups with f/4f/4 processes each: AA and BB, CC and CC^{\prime}. In addition, split PP into two groups DD and DD^{\prime} with f/2f/2 and f/2+1f/2+1 processes respectively. In round 0, every process FIFO broadcasts its initial value to everyone.

[Round 1] At the beginning of the round, a single process f1Cf_{1}\in C receives and relay every value v1,,vf/4v_{1},\ldots,v_{f/4} (in this order) from processes in AA. All processes in AA then crash. Similarly, a single process f1Cf_{1}^{\prime}\in C^{\prime} receives and relay every value v1,,vf/4v_{1}^{\prime},\ldots,v_{f/4}^{\prime} (in this order) from processes in BB, which then crash.

Subsequently, processes in DD and all remaining non-crashed processes in CC receive v1v_{1} from f1f_{1}. Moreover, processes in DD^{\prime} and all remaining non-crashed processes in CC^{\prime} receive v1v_{1}^{\prime} from f1f_{1}^{\prime}. Note that no process in CDC\cup D received v1v_{1}^{\prime} and no process in CDC^{\prime}\cup D^{\prime} received v1v_{1}. Both f1f_{1} and f1f_{1}^{\prime} then crash.

At the end of the round, every initial value from non-crashed processes (sent in round 0) is received and relayed by non-crashed processes.

[Round ii (i2i\geq 2)] At the beginning of the round, single process fiCf_{i}\in C receives and relays vi,,vf/4v_{i},\ldots,v_{f/4} from fi1f_{i-1} (resp. fiCf_{i}^{\prime}\in C^{\prime} receives and relays vi,,vf/4v_{i}^{\prime},\ldots,v_{f/4^{\prime}} from fi1f_{i-1}).

Subsequently, processes in DD and all remaining non-crashed processes in CC receive viv_{i} from fif_{i} (but not viv_{i}^{\prime}). Processes in DD^{\prime} and all remaining non-crashed processes in CC^{\prime} receive viv_{i}^{\prime} from fif_{i}^{\prime} (but not viv_{i}). Both fif_{i} and fif_{i}^{\prime} then crash. Finally, every remaining value sent in the previous round by non-crashed processes are received (and relayed if applicable).

Output conditions. We use CDC\cup D (resp. CDC^{\prime}\cup D^{\prime}) to refer to processes in CDC\cup D. By construction |CD||CD|<f+1|C\cup D|\leq|C^{\prime}\cup D^{\prime}|<f+1. When CDC\cup D receives v1v_{1} in round 11, it gives a clock value of 22 to v1v_{1} (similarly with CDC^{\prime}\cup D^{\prime} and v1v_{1}^{\prime}). In the end of the round, CDC\cup D receives other initial values from non-crashed processes, but not v1v_{1}^{\prime}, so v1v_{1}^{\prime} is attributed a higher clock value later. This ensures that v1v_{1}^{\prime} cannot be in the output without v1v_{1}, since CDC\cup D never receives a quorum of forward messages for v1v_{1} with clock value smaller than that of v1v_{1}^{\prime}. But all the forward messages received later for v1v_{1}^{\prime} from CDC^{\prime}\cup D^{\prime} have their clock values smaller for v1v_{1}^{\prime} than for v1v_{1}, so v1v_{1} also cannot be in the output without v1v_{1}^{\prime}.

In addition, CDC\cup D receives viv_{i} before receiving vi1v_{i-1}^{\prime}, attributing a smaller clock value to viv_{i}. By the end of round ii, CDC\cup D has no quorum for which clock values are smaller for vi1v_{i-1}^{\prime} than for viv_{i}, and thus cannot output vi1v_{i-1}^{\prime}. This creates a chain of dependencies where vi1v_{i-1} cannot be in the output without vi1v_{i-1}^{\prime} (and vice-versa), vi1v_{i-1}^{\prime} cannot be in the output without viv_{i}, and viv_{i} cannot be in the output because not enough forward messages for viv_{i} are received in round ii. Therefore, in the end of round ii, CDC\cup D (and CDC^{\prime}\cup D^{\prime}) is unable to output a value.

The execution described above can be extended for f/4f/4 rounds.

Appendix E Atomic Snapshot Operations

The papers [16] and [20] have a long-lived form of the algorithms in Appendix D, for which one can use to implement AS. In the following, we show that an ASO operation using [16] has constant amortized time complexity, and thus conjecture that it has O(k)O(k) time complexity in the worst-case. On the other hand, [20]’s ASO operation latency is O(n)O(n) even in fault-free runs.

E.1 Atomic Snapshot by Faleiro et al. [16]

The Generalized Lattice Agreement (GLA) described in Figure 9 splits the roles of the processes into proposers, acceptors and learners. For our purpose, we assume that every process performs the three roles. In addition, we add Algorithm 5 on top of the GLA protocol to match the interface used in Algorithm 1.

Refer to caption
Figure 9: LA algorithm as presented in [16].
110:Distributed objects:
111:  GLA instance (Figure 9)
112:operation Propose(v)
113:  ReceiveValue(v)
114:  wait until vv\sqsubseteq LearntValue()
115:  return LearntValue()
Algorithm 5 Bridge protocol for Generalized Lattice Agreement [16].
Theorem E.1.

Consider the ASO protocol built from the composition of Algorithms 1 and 5. An operation takes at most 1616 rounds to complete if, during its interval, no correct process receives a message from a faulty one.

Proof E.2.

A message sent by a correct process is received by every correct process, and if a message sent in round rr is received, it must be received in at most round r+1r+1 (from the definition of the metric). Since no message from faulty processes is received in the interval of the operation, we consider only events performed by correct ones.

First, we show that once a process sends a proposal (line 2424 in Figure 9) for a value vv, all learners learn a value containing vv in at most 88 rounds.

Let ePe_{P} be the event where process ii first sends a proposal for vv, and let 0 be the round assigned to it. By the end of round 11, every (correct) process will have received vv and joined it in acceptedValue, so that every NACK reply will now include vv. As a consequence, every value learned from a proposal (or refinement) made after round 11 must contain vv.

Suppose that some process already learned a value containing vv by the end of round 22, then it received ACKs for this value from a majority of processes (which are correct). Every learner (thus, every process) receives the same ACKs within one round at most and is able to learn the same value.

Now, if no process has already learned a value containing vv, consider the InternalReceive(vv) message which is sent before the proposal. By the end of round 11, every process has received the message and added vv to its buffer, and since no process had learned vv by the end of round 22, every process must be proposing (i.e. status = active).

Let VV be set of all active proposals in the end of round 22, then by the end of round 33 every acceptor will have received every value in VV and added it to acceptedValue. So every reply made in round 44 onward will contain all current values. If a process refines its proposal in round 55, then it must have received at least one reply containing all values for the previous proposal, so by at most round 66 all acceptors would reply with ACKACK and all learners would learn a value by at most round 77.

Now consider the case where a process jj refines its proposal in round 44, it may happen that the refined proposal still misses a value, in which case jj refines again in round 66 (the latest) and this next proposal is guaranteed to include all values. Thus, all processes acknowledge the proposal by at most round 77 and all learners are able to learn a value by round 88.

Let eCe_{C} be the application call event received at a process ii, eRe_{R} its return event, and vv the value received for the operation. If ii is already active, it first buffers vv and waits until the current active proposal finishes before sending a proposal for vv. Consider the worst case where eCe_{C} happens just after ii started a new active proposal. As previously shown, it takes at most 88 rounds until ii can propose a new value from bufferedValues again, and once it proposes vv it can take another 88 rounds at most to learn a value with it. In total, from the call event to the return event, there can be at most 1616 rounds.

Corollary E.3.

Algorithms 1 and 5 together have an amortized time complexity of 1616 rounds.

E.2 Atomic Snapshot by Imbs et al. [20]

Imbs et al. [20] use operations of the SCD-Broadcast protocol to implement atomic snapshot. As such, in the proof for Theorem E.4 we build an execution that takes Ω(n)\Omega(n) rounds for a process to output a value in the SCD-Broadcast protocol, implying that the same time complexity for a snapshot operation.

Figure 10 (extracted from [20]) shows the algorithm for MWMR ASO using SCD-Broadcast. The main difference to the SWMR implementation is the addition of line 33, which includes a “read” phase before updating the array and thus requires two SCD-Broadcast operations instead of one. As we only consider SWMR ASO implementations, we assume that the only operations in the executions are snapshots (which is unchanged and requires a single SCD-Broadcast operation).

Refer to caption
Figure 10: AS algorithm as presented in [20].
Theorem E.4.

A snapshot operation in [20]’s protocol can take Ω(n)\Omega(n) rounds in fault-free runs.

Proof E.5.

First consider an execution of SCD-Broadcast with an even number of processes. We proceed to build an execution where an operation takes nn rounds to complete. We split the system into two groups AA and BB with n/2n/2 processes each. Note that neither AA nor BB alone form a quorum. We also say AA or BB to refer to all processes in AA or BB. In the execution below, every time a process in AA (resp. B) replies a value (sends a forward message for it), all the processes in AA receive it immediately after (similar for B).

[Round 0] A single process a0Aa_{0}\in A sends a forward message with v0Av_{0}^{A} to everyone.

[Round 11] At the beginning of the round, AA receives v0Av_{0}^{A} and forwards the value. Subsequently, a process b0Bb_{0}\in B sends a new forward message for value v0Bv_{0}^{B}, which is received and relayed right away by BB (before v0Av_{0}^{A}). At the end of the round, BB then receives v0Av_{0}^{A} from AA and relays it, but although there is a quorum for v0Av_{0}^{A}, no quorum has each clock assignment for v0Av_{0}^{A} smaller then that of v0Bv_{0}^{B}, thus BB cannot output v0Av_{0}^{A} without v0Bv_{0}^{B}. Since there is no quorum of replies for v0Bv_{0}^{B}, BB cannot output.

[Round 2i2\cdot i] At the beginning of the round, a new process aiAa_{i}\in A sends forward with viAv_{i}^{A}, received right away (before vi1Bv_{i-1}^{B} from BB) by AA, which relays it. Subsequently, AA receives vi1Bv_{i-1}^{B} and relays it. AA is unable to output vi1Av_{i-1}^{A} without vi1Bv_{i-1}^{B} since BB assigned a smaller clock value to vi1Bv_{i-1}^{B}, and is unable to output vi1Bv_{i-1}^{B} without viAv_{i}^{A} since it assigned a smaller clock value to viAv_{i}^{A}, and there is no quorum of replies received for viAv_{i}^{A}.

[Round 2i+12\cdot i+1] At the beginning, a process biBb_{i}\in B sends forward with viBv_{i}^{B}, received right away by BB (before viAv_{i}^{A}) which relays it. Subsequently, BB receives viAv_{i}^{A} and the reply for vi1Bv_{i-1}^{B} from AA in this order. But BB cannot output vi1Bv_{i-1}^{B} without viAv_{i}^{A}, since there is no quorum assigning a smaller clock value to vi1Bv_{i-1}^{B} than to viAv_{i}^{A}. But viAv_{i}^{A} cannot be in the output without viBv_{i}^{B} either, for which BB does not have a quorum of replies. BB is therefore unable to output.

Using the steps above we can delay the execution up to nn rounds. When the number of processes is odd, we split the system into 33 groups: AA, BB and CC, and proceed in a similar fashion as above for AA and BB, but a new process from CC now has initiate a new value in the beginning of every turn in order to delay the execution. This construction can delay the execution up to n/3n/3 rounds. The execution proceeds as following:

[Round 0] A single process a0Aa_{0}\in A sends a forward message with v0Av_{0}^{A} to everyone. A single process c0Cc_{0}\in C sends a forward message with v0Cv_{0}^{C} to everyone.

[Round 11] At the beginning of the round, AA receives v0Av_{0}^{A} and forwards the value, CC receives v0Cv_{0}^{C} and relays it. Subsequently, a process b0Bb_{0}\in B sends a new forward message for value v0Bv_{0}^{B}, which is received and relayed right away by BB (before v0Av_{0}^{A} or v0Cv_{0}^{C}). Also, another process c1Cc_{1}\in C sends a forward message with v1Av_{1}^{A} to everyone, which CC receives and forward immediately after. At the end of the round, BB receives v0Av_{0}^{A} from AA and v0Cv_{0}^{C} from CC and relays them, but although there is a quorum for v0Av_{0}^{A} and v0Cv_{0}^{C}, no quorum has each clock assignment for v0Av_{0}^{A} or v0Cv_{0}^{C} smaller then that of v0Bv_{0}^{B}, thus BB cannot output v0Av_{0}^{A} and v0Cv_{0}^{C} without v0Bv_{0}^{B}. Since there is no quorum of replies for v0Bv_{0}^{B}, BB cannot output. Similarly, AA receives v0Cv_{0}^{C} and CC receives v0Av_{0}^{A} but they cannot output.

[Round 2i2\cdot i] At the beginning of the round, a new process aiAa_{i}\in A sends forward with viAv_{i}^{A}, received right away (before vi1Bv_{i-1}^{B} from BB and v2i1Cv_{2\cdot i-1}^{C} from CC) by AA, which relays it. Similarly, a process c2iCc_{2\cdot i}\in C sends forward with v2iCv_{2\cdot i}^{C} before CC receives vi1Bv_{i-1}^{B} from BB.

Subsequently, AA receives vi1Bv_{i-1}^{B} and v2i1Cv_{2\cdot i-1}^{C} and relays them. AA is unable to output vi1Av_{i-1}^{A} without vi1Bv_{i-1}^{B} or v2i1Cv_{2\cdot i-1}^{C} since BB assigned a smaller clock value to vi1Bv_{i-1}^{B} and CC assigned a smaller clock value to v2i1Cv_{2\cdot i-1}^{C}, and is unable to output vi1Bv_{i-1}^{B} and v2i1Cv_{2\cdot i-1}^{C} without viAv_{i}^{A} since it assigned a smaller clock value to viAv_{i}^{A}, and there is no quorum of replies received for viAv_{i}^{A}. Moreover, CC receives vi1Bv_{i-1}^{B} from BB. CC cannot output v2i2Cv_{2\cdot i-2}^{C} without either vi1Av_{i-1}^{A} or vi1Bv_{i-1}^{B} because AA assigned a smaller value to vi1Av_{i-1}^{A} and BB assigned a smaller value to vi1Bv_{i-1}^{B}. But both cannot be output without v2i1Cv_{2\cdot i-1}^{C}, which was assigned a smaller value, and there is no quorum for v2i1Cv_{2\cdot i-1}^{C} at CC.

[Round 2i+12\cdot i+1] At the beginning, a process biBb_{i}\in B sends a forward message with viBv_{i}^{B}, received right away by BB (before viAv_{i}^{A} or v2iCv_{2\cdot i}^{C}) which relays it. A process c2i+1c_{2\cdot i+1} sends forward with v2i+1v_{2\cdot i+1}, before receiving viAv_{i}A.

Subsequently, BB receives viAv_{i}^{A} and v2iCv_{2\cdot i}^{C} as well as the replies for vi1Bv_{i-1}^{B} from AA and CC in this order. But BB cannot output vi1Bv_{i-1}^{B} without viAv_{i}^{A} or v2iCv_{2\cdot i}^{C}, since AA and CC assigned smaller clock values to viAv_{i}^{A} and v2iCv_{2\cdot i}^{C} respectively. But neither viAv_{i}^{A} nor v2iCv_{2\cdot i}^{C} can be in the output without viBv_{i}^{B}, for which BB does not have a quorum of replies. Now, CC receives viAv_{i}^{A} and both replies for v2i1v_{2\cdot i-1} from BB and AA in this order. But AA assigned a smaller value to viAv_{i}^{A} and BB also to vi1Bv_{i-1}^{B}, and neither can be output without viv_{\cdot i}, for which CC has no quorum.

Corollary E.6.

A snapshot operation in [20]’s protocol can take Ω(n)\Omega(n) rounds in the worst-case.