# Raft consensus algorithm implementation for Seastar
Seastar is a high performance server-side application framework
written in C++. Please read more about Seastar at http://seastar.io/
This library provides an efficient, extensible, implementation of
Raft consensus algorithm for Seastar.
For more details about Raft see https://raft.github.io/
## Terminology
Raft PhD is using a set of terms which are widely adopted in the
industry, including this library and its documentation. The
library provides replication facilities for **state machines**.
Thus the **state machine** here and in the source is a user
application, distributed by means of Raft.
The library is implemented in a way which allows to replace/
plug in its key components:
- communication between peers by implementing **rpc** API
- persisting the library's private state on disk,
via **persistence** class
- shared failure detection by supplying a custom
**failure detector** class
- user state machine, by passing an instance of **state
machine** class.
Please note that the library internally implements its own
finite state machine for protocol state - class fsm. This class
shouldn't be confused with the user state machine.
## Implementation status
---------------------
- log replication, including throttling for unresponsive
servers
- managing of the user's state machine snapshots
- leader election, including the pre-voting algorithm
- non-voting members (learners) support
- configuration changes using joint consensus
- read barriers
- forwarding commands to the leader
## Usage
-----
In order to use the library, the application has to provide implementations
for RPC, persistence and state machine APIs, defined in `raft/raft.hh`,
namely:
- `class rpc`, provides a way to communicate between Raft protocol instances,
- `class persistence` persists the required protocol state on disk,
- class `state_machine` is the actual state machine being replicated.
A complete description of expected semantics and guarantees
is maintained in the comments for these classes and in sample
implementations. Let's list here key aspects the implementor should
bear in mind:
- RPC should implement a model of asynchronous, unreliable network,
in which messages can be lost, reordered, retransmitted more than
once, but not corrupted. Specifically, it's an error to
deliver a message to a wrong Raft server.
- persistence should provide a durable persistent storage, which
survives between state machine restarts and does not corrupt
its state. The storage should contain an efficient mostly-appended-to
part containing Raft log, thousands and hundreds of thousands of entries,
and a small register-like memory are to contain Raft
term, vote and the most recent snapshot descriptor.
- Raft library calls `state_machine::apply_entry()` for entries
reliably committed to the replication log on the majority of
servers. While `apply_entry()` is called in the order
the entries were serialized in the distributed log, there is
no guarantee that `apply_entry()` is called exactly once.
E.g. when a server restarts from the persistent state,
it may re-apply some already applied log entries.
Seastar's execution model is that every object is safe to use
within a given shard (physical OS thread). Raft library follows
the same pattern. Calls to Raft API are safe when they are local
to a single shard. Moving instances of the library between shards
is not supported.
### First usage.
For an example of first usage see `replication_test.cc` in test/raft/.
In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster
on this node
- call server::start() to start the server
- repeat the above for every node in the cluster
- use `server::add_entry()` to submit new entries
`state_machine::apply_entries()` is called after the added
entry is committed by the cluster.
### Subsequent usages
Similar to the first usage, but internally `start()` calls
`persistence::load_term_and_vote()` `persistence::load_log()`,
`persistence::load_snapshot()` to load the protocol and state
machine state, persisted by the previous incarnation of this
server instance.
## Architecture bits
### Joint consensus based configuration changes
Seastar Raft implementation provides arbitrary configuration
changes: it is possible to add and remove one or multiple
nodes in a single transition, or even move Raft group to an
entirely different set of servers. The implementation adopts
the two-step algorithm described in the original Raft paper:
- first, a log entry with joint configuration is
committed. The "joint" configuration contains both old and
new sets of servers. Once a server learns about a new
configuration, it immediately adopts it, so as soon as
the joint configuration is committed, the leader will require two
majorities - the old one and the new one - to commit new entries.
- once a majority of servers persists the joint
entry, a final entry with new configuration is appended
to the log.
If a leader is deposed during a configuration change,
a new leader carries out the transition from joint
to the final configuration.
No two configuration changes could happen concurrently. The leader
refuses a new change if the previous one is still in progress.
### Multi-Raft
One of the design goals of Seastar Raft was to support multiple Raft
protocol instances. The library takes the following steps to address
this:
- `class server_address`, used to identify a Raft server instance (one
participant of a Raft cluster) uses globally unique identifiers, while
provides an extra `server_info` field which can then store a network
address or connection credentials.
This makes it possible to share the same transport (RPC) layer
among multiple instances of Raft. But it is then the responsibility
of this shared RPC layer to correctly route messages received from
a shared network channel to a correct Raft server using server
UUID.
- Raft group failure detection, instead of sending Raft RPC every 0.1 second
to each follower, relies on external input. It is assumed
that a single physical server may be a container of multiple Raft
groups, hence failure detection RPC could run once on network peer level,
not sepately for each Raft instance. The library expects an accurate
`failure_detector` instance from a complying implementation.
### Pre-voting and protection against disruptive leaders
tl;dr: do not turn pre-voting OFF
The library implements the pre-voting algorithm described in Raft PHD.
This algorithms adds an extra voting step, requiring each candidate to
collect votes from followers before updating its term. This prevents
"term races" and unnecessary leader step downs when e.g. a follower that
has been isolated from the cluster increases its term, becomes a candidate
and then disrupts an existing leader.
The pre-voting extension is ON by default. Do not turn it OFF unless
testing or debugging the library itself.
Another extension suggested in the PhD is protection against disruptive
leaders. It requires followers to withhold their vote within an election
timeout of hearing from a valid leader. With pre-voting ON and use of shared
failure detector we found this extension unnecessary, and even leading to
reduced liveness. It was thus removed from the implementation.
As a downside, with pre-voting *OFF* servers outside the current
configuration can disrupt cluster liveness if they stay around after having
been removed.
### RPC module address mappings
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
New nodes are added to the RPC configuration after the
configuration change is committed but before the instance
sends messages