Computer Science Talks: Unreliable failure detectors for reliable distributed systems

Paper here

This is a seminal paper about failure detectors. A failure detector is an application or a subsystem that is responsible for detection of node failures or crashes in a distributed system. Failure detectors are only interesting if you can actually build them. In a fully asynchronous system, you cannot build one, but with timeouts, it's not hard: have each process ping each other process from time to time, and suspect the other process if it does not respond to the ping within twice the maximum round-trip time for any previous ping.

Classification of failure detectors

In this paper it is defined eight classes of failure detectors, based on when they suspect faulty processes and non-faulty processes. Suspicion of faulty processes comes under the heading of completeness; of non-faulty processes, accuracy.

Degrees of completeness

Strong completeness: Every faulty process is eventually permanently suspected by every non-faulty process (Everybody suspect a dead process).
Weak completeness: Every faulty process is eventually permanently suspected by some non-faulty process (Somebody suspects a dead process).

"Eventually permanently" means that there is some time $t0$ such that for all times $t≥t0$ (subsequent time), the process is suspected.
Note that completeness says nothing about suspecting non-faulty processes: a paranoid failure detector that permanently suspects everybody has strong completeness.

Degrees of accuracy

These describe what happens with non-faulty processes, and with faulty processes that haven't crashed yet.

Strong accuracy: No process is suspected (by anybody) before it crashes.
Weak accuracy: Some non-faulty process is never suspected. There are some non-faulty processes that are mistakenly suspected.
Eventual strong accuracy: After some initial period of confusion or chaos, no process is suspected before it crashes. This can be simplified to say that no non-faulty process is suspected after some time.
Eventual weak accuracy: After some initial period of confusion or chaos, some non-faulty process is never suspected.

strong and weak mean different things for accuracy vs completeness:

for accuracy, we are quantifying over suspects,
for completeness, we are quantifying over suspectors (processes that suspects other processes).

Even a weakly-accurate failure detector guarantees that all processes trust the one visibly good process.
Below, is an example of weak completeness. Here, if p crashes, somebody will eventually suspect it, and notices all the processes. The suspicion is true if p does not contradict.

initially suspects = ø

do forever:
    for each process p:
        if my weak-detector suspects p, then send p to all processes

upon receiving p from some process q:
    suspects := suspects + p - q

This example preserves accuracy, because if there is some good guy, non-faulty process, that everybody trusts (as in weak accuracy), nobody ever reports p as suspected. For strong accuracy, every non-faulty process will suspect p. For eventual weak accuracy, it is necessary to wait to everybody contact p, and wait for the response, so that someone stop suspecting p. In eventual strong accuracy everybody stops suspecting.

Failure detector classes

There are two degrees of completeness, and four degrees of accuracy, that gives eight classes of failure detectors.

The distinction between strong and weak completeness turns out to be spurious (false); a weakly-complete failure detector can simulate a strongly-complete one (but this requires a proof). We can use this as an excuse to consider only the strongly-complete classes:

P (perfect): Strongly complete and strongly accurate: non-faulty processes are never suspected; faulty processes are eventually suspected by everybody. Easily achieved in synchronous systems.
S (strong): Strongly complete and weakly accurate. The name is misleading if we've already forgotten about weak completeness, but the corresponding W (weak) class is only weakly complete and weakly accurate, so it's the strong completeness that the S is referring to.
⋄P (eventually perfect): Strongly complete and eventually strongly accurate.
⋄S (eventually strong): Strongly complete and eventually weakly accurate.

Jumping to the punch line:

P can simulate any of the others,
S and ⋄P can both simulate ⋄S but can't simulate P or each other,
⋄S can't simulate any of the others.

Thus, ⋄S is the weakest class of failure detectors in this list, but it is strong enough to solve consensus. Any failure detector (whatever its properties) that can solve consensus is strong enough to simulate ⋄S - this makes ⋄S the "weakest failure detector for solving consensus". We can consider ◊W and ◊S similar, because it is easy to obtain strong completeness from weak completeness. For this, it is just needed to execute this code (same as above).

initially suspects = ø

do forever:
    for each process p:
        if my weak-detector suspects p, then send p to all processes

upon receiving p from some process q:
    suspects := suspects + p - q

Consensus

Consensus is a decision-making by all elements of the group. The failure detectors help to build consensus in asynchronous systems with crash failures. Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures.

Consensus with S

Here, there is some non-faulty process c that nobody ever suspects, and this is enough to solve consensus. The basic idea of the protocol is, there are 3 phases:

Processes gossip about the input values for n-1 rounds.
They prune out any that are not universally known.
Everybody agrees on the lowest-id input that has not been pruned.

This protocol satisfies validity because after round 1, a value v is in every process because it was gossiped. In round 2, all processes intersects every values and they still leaves v, so that in round 3 one of them is picked up.

To get termination, nobody ever waits forever for a message it wants. The first non-faulty process that gets stuck eventually is informed by the S-detector that the process it is waiting for is dead.

For agreement we must guarantee that all processes have the same set of values. This is achieved because of the intersection in the round 2.

Consensus with ⋄S and f<n/2

The consensus protocol for S depends on some process c never being suspected; if c is suspected during the entire execution of the protocol, as it can happen with ⋄S, then it is possible that no process will wait to hear from c (or anybody else) and the processes will all decide their own inputs.
To solve consensus with ⋄S we will need to assume less than $f=n/2$ failures, allowing any process to wait to hear from a majority no matter what lies its failure detector is telling it.
The protocol uses reliable broadcast to guarantee that any message sent is received by all non-faulty processes, or it is not received by no process.

Each process keeps track of a preference (initially its own input value) and a timestamp, the round number in which it last updated its preference.
The processes go through a sequence of asynchronous rounds, each divided into four phases:
1. All processes send (round, preference, timestamp) to the coordinator for the round.
2. a) The coordinator waits to hear from a majority of the processes (possibly including itself). b) The coordinator sets its own estimate to some estimate with the largest timestamp of those it receives, c) and sends (round, estimate) to all processes.
3. Each process waits for the new proposal from the coordinator or for the failure detector to suspect the coordinator. If it receives a new estimate, it adopts it as its own, sets timestamp ← round, and sends (round, ack) to the coordinator. Otherwise, it sends (round, nack) to the coordinator.
4. The coordinator waits to receive ack or nack from a majority of processes. If it receives ack from a majority, it announces the current estimate as the protocol decision value using ReliableBroadcast.
Any process that receives a value in a ReliableBroadcast decides on it immediately.

The properties for ◊S consensus are:

Termination: Every correct process eventually decides on some value. All processes or are not waiting, or are waiting for a majority of decided values from non-faulty processes. The problem is on processes that decided to stop participating in the protocol; but because any non-faulty process retransmits the decision value in the ReliableBroadcast, if a process is waiting for a response from a non-faulty process that already terminated, eventually it will get the ReliableBroadcast instead and terminate itself. In phase 3, a process might get stuck waiting for a dead coordinator, but the strong completeness of ⋄S means that it suspects the dead coordinator eventually and escapes. So at worst, we do infinitely many rounds.
Validity: If a process decides v, then v was proposed by some process. The decision value is an estimate, and all estimates start from inputs.
Agreement: No two correct processes decide differently. It is possible that two coordinators both initiate a ReliableBroadcast and some processes choose the value from the first, and other process the value from the second. In this case the first coordinator collected acks from a majority of processes in some round r, and all subsequent coordinators collected estimates from an overlapping majority of processes in some round $r'>r$. By applying the same induction argument as for Paxos we get that all subsequent coordinators choose the same estimate as the first coordinator, and so we get agreement. Basically, the quorums will overlap, and so, they will just return one correct value.

f<n/2 is still required even with ⋄P

With a majority of failures, $f≥n/2$, it can bring troubles with ⋄P (and thus with ⋄S, which is trivially simulated by ⋄P).
The reason is that ⋄P can lie to us for some long initial interval of the protocol, and consensus is required to terminate eventually despite these lies. For example, we start half of the processes with input 0, half with 1, and run both halves independently with ⋄P suspecting the other half until the processes in both halves decide on their common inputs (different between the halves). With $f<n/2$, the processes will get just one value that came from a majority.

Relationship among classes

Failure detectors can be compared. A failure detector D2 is said to be weaker than a failure detector D1 if there is an asynchronous algorithm, called a reduction algorithm, which, using D1, can emulate D2. The existence of a reduction depends on the environment that it is considered.

P simulates S, and ◊P simulates ◊S without modification. Also, P simulates ◊P, and S simulates ◊S. There is no simulation from ◊P to S, S to ◊P, and from ◊S to any other classes.

Partial synchrony

Fischer, Lynch and Paterson showed that Consensus cannot be solved in an asynchronous system subject to crash failures. The fundamental reason why Consensus cannot be solved in completely asynchronous systems is the fact that, in such systems, it is impossible to reliably distinguish a process that has crashed from one that is merely very slow. In other words, Consensus is unsolvable because accurate failure detection is impossible. On the other hand, it is well-known that Consensus is solvable (deterministically) in completely synchronous systems — that is, systems where clocks are perfectly synchronised, all processes take steps at the same rate and each message arrives at its destination a fixed and known amount of time after it is sent. In such a system, we can use timeouts to implement a "perfect" failure detector — i.e., one in which no process is ever wrongly suspected, and every faulty process is eventually suspected Thus, the capacity to solve consensus is related to the failure detection capabilities.

Between asynchronous and synchronous system, there are the partial synchronous systems. In this systems, the bounds are known after some time during the execution. In this ◊P algorithm, a process p is sending periodically a "p_is_alive" message to all other processes. Lets suppose that q is the receiver. If p suspects that q did not receive the message, it will increase the timeout to get the response from q, and it will resend the message to q. Eventually, the timeout will be so big, that all others timeouts will be within the q timeout interval, that will make this model synchronous. After some time t', all correct processes will suspect q, and then the strong completeness is satisfied.

This model holds strong completeness, and eventual strong accuracy.

Using atomic broadcast to get Consensus

Consensus can easily be implemented using atomic broadcast Atomic Broadcast requires that all correct processes deliver the same messages in the same order. The total order and agreement properties of Atomic Broadcast ensure that all correct processes deliver the same sequence of messages. Consensus and Atomic Broadcast are equivalent in asynchronous systems with crash failures.

Atomic Broadcast cannot be solved by a deterministic algorithm in asynchronous systems, even if we assume that at most one process may fail, and it may only fail by crashing. This is because Consensus has no deterministic solution in such systems.

Atomic Broadcast can be solved using randomisation or unreliable failure detectors in asynchronous systems. This is because Consensus is solvable using these techniques in such systems.

Consensus can be easily reduced to Atomic Broadcast as follows. To propose a value, a process atomically broadcasts it. To decide a value, a process picks the value of the first message that it atomically delivers. By total order of Atomic Broadcast, all correct processes deliver the same first message. Hence they choose the same value and agreement of Consensus is satisfied.

Using Consensus to get atomic broadcast

The atomic algorithm uses several executions of consensus to get atomic broadcast. Intuitively, the k^th execution of Consensus is used to decide on the k^th batch of messages to be atomically delivered. Processes disambiguate between these executions by tagging all the messages pertaining to the k^th execution of Consensus with the counter k.

In this algorithm, a message is A-broadcast using R-broadcast primitive. When the message is R-delivered, it adds m to the set of R_delivered_p. When m is delivered atomically, it is added to A-delivered_p.

After m being R-delivered, the process p will wait for the consensus of other hosts (task 3) to decide which message should be delivered. If the consensus is reached, the message is A_delivered. The message is A-delivered in some deterministic order that was agreed by all processes, e.g., lexicographical order.

Computer Science Talks

Friday, 7 November 2014

Unreliable failure detectors for reliable distributed systems