Computer Science Talks: A Guided Tour on the Theory and Practice of State Machine Replication

Paper here and Paxos paper here

In this paper it is presented the fundamentals and applications of the State Machine Replication (SMR) technique for implementing consistent fault-tolerant services.
Service replication is a technique in which copies of the service are deployed in a set of servers. This technique is often used to: 1. increase the system performance and capacity 2. provide fault tolerance.
Replication can increase the service' performance and capacity since the replicas provide more resources to the service. Fault tolerance can be achieved with several replicas through the use of spatial redundancy to the system, making it continue to operate despite the failure of a fraction of these replicas.
Regarding the maintenance of consistent state in replicated systems, there are two fundamental approaches: primary-backup (or passive) replication and state machine (or active) replication.
In the primary-backup replication model there is one primary replica that executes all operations issued by the clients and, periodically, pushes state updates to the backup replicas. Furthermore, these replicas keep monitoring the primary to ensure that one of them takes over its role in case it fails. In the state machine replication model clients issue commands to all replicas, which execute them in a coordinated way and produce a reply. In this model, no monitoring and synchronization is necessary.

SMR

State machine replication is usually defined through a set of clients submitting commands to a set of replicas behaving as a "centralized" service, or a replicated state machine(RSM). This implementation holds 3 properties:

Initial state: All correct replicas start on the same state;
Determinism: All correct replicas receiving the same input on the same state produce the same output and resulting state;
Coordination: All correct replicas process the same sequence of commands

From the practical point of view, Initial State and Determinism are not easy to ensure. In the former, there is complexity when replicas crash and return, and the states are outdated. For Determinism, it can constrain the service performance.
To satisfy Determinism in RSM, the service supported by the replicas is single threaded.
In RSM, it just answers client requests. It does not initiate communication.
The RSM must satisfy the following safety and liveness:

Safety: all correct replicas execute the same sequence of operations (something bad will never happ);
Liveness: all correct clients operations are executed. (something good eventually will happen.)

The Safety property allows the implementation of strongly consistent services, satisfying the consistency property known as Linearizability (single operations in single object. Writes should appear instantaneous. Linearizability for reads and writes is a synonymous of atomic consist).

Fault Models

2 Kind of faults are considered for the processes in SMR based system:

Fail-stop: A faulty (crashed) process halts and does not execute any further operations during the system execution;
Byzantine: A faulty process may behave arbitrarily, i.e., deviating from its algorithm and taking any action.

It is also important to define which kind of communication channels errors can affect the system. The main fault models for channels are:

Reliable: every message sent will be delivered;
Unreliable: messages can be modified, generated or dropped;
Lossy: messages may be lost, but not modified;
Fair: if a message is resent infinitely often, it will arrive infinitely often.

An important implicit assumption in replicated systems is that correlated failures will not happen, i.e., that the probability of a replica to fail is statistically independent from the probability of any other replica of the system to fail.

Synchrony

Asynchronous: The weakest synchrony model is the asynchronous distributed system. In this model, processing and communication have no time bounds, and there are no physical clocks. Therefore, in this model the notion of time does not even exists.
Synchronous: In these systems, there are known bounds in terms of processing, communication and clock error in different processes. This modified represents real-time systems.
Partially Synchronous: This model considers that the system behaves asynchronously until a certain (unknown) instant GST (Global Stabilization Time), in which the system becomes synchronous, respecting some (unknown) processing and communication time bounds. This system does not need to be synchronous forever, just until a defined point.

Cryptography

Byzantine fault-tolerant SMR protocols usually consider the existence of authenticated communication channels, which can only be implemented if either a Public-key infrastructure is available for supporting the use of asymmetric cryptography for message signatures or the existence of shared secrets between each pair of processes (which can be supported by a key distribution center like Kerberos) for enabling the use of Message Authentication Codes (MAC).

Fundamentals results of distributed computing

There are 5 fundamentals results from distributed computing theory that constrains the design of SMR protocols.

(R1) Impossibility of reliable communication on top of unreliable channels.
(R2) Equivalence between total order multicast and consensus
(R3) Impossibility of fault-tolerant consensus in asynchronous systems
(R4) Minimum synchrony required for fault-tolerant consensus.
(R5) Fault thresholds.

For R1, TCP-IP is built over a best-effort protocol (IP), which gives no reliability guarantees. However, Writes the fair link assumption defined above, we can make a reliable channel by sending repeatedly the message until it is ACKed.
For R2, a fundamental requirement for implementing SMR is the coordination of replicas, which requires that all correct replicas receive the same messages in the same order. Conceptually, satisfying this requirement requires the implementation of a total order multicast (or atomic multicast) protocol. A fundamental result in distributed computing is the equivalence between the problem of implementing this kind of communication primitive and solving consensus (Using atomic broadcast to make consesus). In the consensus problem (Use consensus to make atomic broadcast), processes propose some value and try to reach agreement about one value to decide, which must be one of the proposed values (or the first one). This equivalence is important because it shows that the implementation of a RSM is subject to the same constraints and results of the well-studied consensus problem, which leads us to our next fundamental result.
For R3, one of the most well-known results in distributed computing is the impossibility of achieving (deterministic) consensus in an asynchronous system in which a single process can crash, because we do not know if a process has really crashed or it is slow. This can lead to problem that the consensus never terminate of can lead to different processes decide different values.
For R4, by separating safety and liveness, it is possible to solve consensus in a Partially Synchronous system model.
For R5, different system models require a different number of processes to implement consensus and total order broadcast tolerating up to f failures.
In the table we consider the synchronous and partially synchronous system models for crash, Byzantine and authenticated Byzantine failures. These results reflect the exact number of replicas a f-resilient RSM must contain in order to implement replica coordination. We complement these results with the minimal number of replicas required to execute commands after it is delivered in total order However, once the total order is established, f less replicas are required to execute the operations in both fault models.

Failure type	Synchronous	Partially Synch.	Replicated Execution
Fail-stop	f+1	2f+1	f+1
Auth. Byz.	f+1	3f+1	2f+1
Byzantine	3f+1	3f+1	2f+1

Baseline protocols

In this section it is described the Viewstamped replication and Paxos for implementing RSM. The main challenge in implementing RSMs is to ensure that the operations requested by clients are executed in the same order at all replicas in spite of concurrency and failures.

Paxos and Viewstamped Replication (VR)

VR was devised at the same time as Paxos, and it is similar to Paxos. VR works is an asynchronous network like internet, but deep down, it needs a way to make things synchronous in some way so that all processes vote for the same value.
The relevance of Paxos/VR comes from the fact this algorithm has served as the foundation for many recent replicated (consistent and fault-tolerant) data stores, coordination and configuration management system.
Paxos/VR requires n=2f+1 replicas, in which up to f can be subject to crash faults. The system must be able to execute a request without waiting for f replicas.
To ensure safety, the protocol must ensure that, even if these f replicas fail, there will be at least one replica that processed the request and that will participate in other protocol executions. This implies that any two quorums of replicas accessed in the protocol must intersect at least in one correct replica. Consequently, each step of the protocol must be processed by a quorum of at least f+1 replicas that, together with the other f that may or may not reply, compose the n.

Normal operation

The client sends a Request with the operation to be executed to the leader replica;
The leader chooses a sequence number i for the received request, writes it to its log, and disseminates the request and its sequence number to all other replicas in a PREPARE message;
If the replicas did not assign any other request to sequence number i, they accept the leader proposal and write the update (request plus sequence number) to their log, replying with a PREPARE-OK message to the leader;
The leader waits for f confirmations from other replicas and then executes the request and sends the Reply to the client. This ensures that a majority of the replicas have the request in their logs and, consequently, the request will be visible even if the leader fails;
Normally, the leader informs the other replicas about this request execution in the next PREPARE message, making them also execute the client request for updating their states, without sending a reply to the client.

If a client does not receive a reply for a request within a given time period, it resends the request to all replicas, ensuring that it reaches a new leader.
VR shows two improvements when compared with Paxos for implementing RSMs (Multi-Paxos).

In Paxos, the leader sends an explicit COMMIT message to make the other replicas learn that the request can be executed.
Paxos explicitly considers the durability of the service, all accepted PREPARE messages are logged in stable storage.

We can still optimize VR by adding the following:

Read-only operations can be executed directly by the leader without contacting the other replicas.
if the replicas need to learn that the client request was executed, instead of having the explicit COMMIT message as in classical Paxos, they can send the PREPARE-OK messages to all other replicas, and each replica can execute the request if it receives f of these messages

View change

If the leader fails, messages will not be ordered and the system will stop executing client requests. To recover from this situation, non-leader replicas monitor the leader and if they suspect that it is faulty, a new leader is elected to ensure the algorithm makes progress.

The rationale behind this protocol is that in order for a request to be executed, it must be in the log of at least f+1 replicas. Consequently, the order assigned for a request is preserved by the view change protocol since the new leader collects f+1 logs

Recovery

When a replica restarts after a crash, it cannot participate in request processing and view changes until it has a state at least as recent as when it failed. The recovering works like this:

The recovering replica sends a message to all other replicas asking for the current state;
Each replica sends a reply with its log, among other information;
The recovering replica waits for f+1 of such messages, including one from the current leader it discovered from the received messages. Then, it updates its state from the received leader log. At this point the replica is recovered and can resume execution.

Paxos

This summary comes from the Lamport's article "Paxos made simple". The paper is provided in the link at the top of the article. /safe Paxos is a family of protocols for solving consensus in a network of unreliable processors.
The Synod algorithm is the center piece of the Paxos algorithm. It is executed separately for each log index, i.e., one Synod execution for the first item in the log, another Synod execution for the second item in the log, and so on. Multiple Synod algorithms can be executed concurrently, without affecting each other's results.
In a nutshell, the Synod algorithm is an agreement (consensus) algorithm which guarantees that:

all the nodes participating have the same output value,
the output value is proposed by some node (not just an arbitrary value),
if enough nodes can communicate with each other, eventually the algorithm will finish.

The algorithm is constructed such that any node can propose a value, and at the end it is assured that only one of those values is chosen. However, it is possible that different nodes will propose values at different times such that the algorithm runs forever. To assure the termination of the Synod algorithm, only the leader node can propose values.
In order to guarantee safety, Paxos defines three safety properties and ensures they are always held, regardless of the pattern of failures:

Non-triviality: Only proposed values can be learned.
Safety: At most one value can be learned (i.e., two different learners cannot learn different values).
Liveness: If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty).

To sum up, Paxos always guarantees "safety", and ensures "liveness" if a majority of the nodes are communicating with each other.
The processes can have 3 roles: proposers, acceptors, and learners.

This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors.

Phase 1

Phase 1a: Prepare

A Proposer (the leader) creates a proposal identified with a number N. This number must be greater than any previous proposal number used by this Proposer. Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors. The Proposer decides who is in the Quorum.

Phase 1b: Promise

If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer. Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (Nack) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.

Phase 2

Phase 2a: Accept Request

If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal. If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal. The Proposer sends an Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.

Phase 2b: Accepted

If an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only if it has not already promised to only consider proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request. Note that an Acceptor can accept multiple proposals. These proposals may even have different values in the presence of certain failures. However, the Paxos protocol will guarantee that the Acceptors will ultimately agree on a single value. Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number. Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.

Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. After completing phase 1 of the proposer's algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.

Computer Science Talks

Thursday, 6 November 2014

A Guided Tour on the Theory and Practice of State Machine Replication

SMR