3. Week 03: Raft, FLP, CAP, and Byzantine Fault Tolerance

3.1. Alternative Consensus Protocols: Raft

Another frequently levelled critique of Paxos is that its architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order.

As an alternative design, scoped around log, a Raft consensus protocol has been proposed.

Mandatory Reading

Check out the original paper on Raft entitled In Search of an Understandable Consensus Algorithm. I suggest, before reading the paper, you first check out this video on Raft, and try Raft Visualisation. Also, the slides on Raft are very helpful.

Highlights

  • Raft achieves consistency through bounded delays and the “heartbeat” mechanism
  • Split vote when choosing the leader is achieved by simiply timing-out in a current round and proceeding to the next one.
  • The protocol deals with inconsistencies (missing and extraneous entries) by making the leader override the log of the followers, assuming its own log is always correct.
  • Leader candidates with the stale log simply don’t get elected.

3.2. FLP Theorem

The FLP theorem states that in an asynchronous network where messages may be delayed but not lost, there is no consensus algorithm that is guaranteed to terminate in every execution for all starting conditions, if at least one node may experience failure.

Mandatory Reading

The paper by Fisher, Lynch, and Paterson (hence the name) provides the formal statement of this result and its (very elegant) proof.

Highlights:

  • The formulation in a paper is somewhat misleading: “No completely asynchronous consensus protocol can tolerate even a single unannounced process death.” We’ll discuss below what in fact the FLP result is about (not that).

  • The proof builds on three lemmas:

    • Lemma 1: Diamond property of disjoint configurations
    • Lemma 2: Existence of a bivalent state
    • Lemma 3: “Preservation”: from a bivalent state, on can always come to another bivalent state (by delaying the message that would bring the system to a univalent state).

    Altogether, this means that we can be in a bivalent state infinitely.

  • Key idea of Lemma 3: a bivalent system can be forced to do some work and yet remain in a bivalent state. We can “pump” this to generate indefinite runs that never decide. Interesting insight: no failures actually occur (just delays). FLP attacks a faulttolerant protocol using fault-free runs!

  • Intuition behind this result:

    • Think of a real system trying to agree on something in which process p plays a key role.
    • But the system is fault-tolerant: if p crashes it adapts and moves on. This is a crucial part for the proof of Lemma 2: even without once process we should be able to run the protocol.
    • In Lemma 3, the proof “tricks” the system into treating p as if it had failed, but then lets p resume execution and “rejoin”.
    • This takes time and no real progress occurs.
  • But what did “impossibility” mean?

    • In formal proofs, an algorithm is totally correct if (a) it computes the right thing and (b) it always terminates.
    • When we say something is possible, we mean “there is a totally correct algorithm” solving the problem.
    • FLP proves that any fault-tolerant algorithm solving consensus has runs that never terminate. These runs are extremely unlikely. Yet, they imply that we can’t find a totally correct solution.
    • “Consensus is impossible” thus means “consensus is not always possible”.

3.3. The CAP Theorem and its variants

The CAP theorem states that in an asynchronous network where messages may be lost, it is impossible to implement a sequentially consistent atomic read/write register that responds eventually to every request under every pattern of message loss.

Mandatory Reading

The result is described formally in this manuscript by Gilbert and Lynch.

Fun fact: Seth Gilbert is currently an associate professor at NUS School of Computing.

Highlights:

  • What is what in CAP?
    • Consistency: Sequential consistency (a data item behaves as if there is one copy)
    • Availability: Node failures do not prevent survivors from continuing to operate
    • Partition-tolerance: The system continues to operate despite network partitions
  • CAP Theorem says that “A distributed system can satisfy any two of these guarantees at the same time but not all three”.

Mandatory Reading

This paper generalises the idea CAP result for more realistic applications that, in a normal (non-partitioned case) must choose between consistency and latency.

Highlights:

  • Partitions are rare, and there is no need to sacrifice consistency all the time. One needs to decide what to keep (A or C) in the case if a partition occurs.
  • A more interesting choice is what to prioritise under the normal operation: consistency or latency? PACELC provides a framework for stating this choice.
  • Different consistency/latency trade-offs appear due to the choice of a replication strategy.

3.4. Byzantine Generals Problem

The seminal paper by Lamport, Shostak and Pease addresses the problem of reaching the consensus in an environment where some participants can send faulty messages, so called Byzantine Generals Problem (BGP). The main results are summarised in the following slides used in the lecture:

Mandatory Reading

Please, read The Byzantine Generals Problem paper for the definitions and the details of the protocols. You may skip the proofs of the theorems.

Optional Reading

The paper Practical Byzantine Fault Tolerance explains a de-facto state-of-the-art solution to BGP that uses digital signatures and achieves consensus in the presence of partial synchrony.

Remark: PBFT allows to achieve Byzantine consensus with `3f + 1 participants in the assumptions of partial synchrony, i.e., the setting, in which fixed bounds on processor speed and message delays exist but they are not known a priori.