Tuesday 30 June 2015

Implementing Fault-Tolerant Services Using the State Machine Approach

A state machine consists of state variables, which encode its state, and commands, which transform its state. The execution of the command is atomic in respect to the other commands and modifies the state variables and produces some output.

Byzantine Failures - the component can exhibit arbitrary and malicious behaviour, perhaps evolving collusion with other faulty components. If a process is faulty, it can overwrite all memory locations, and destroy information. If several machines collude, the processes can write the same wrong result in memory.

Fail-stop Failures - In response to a failure, the component changes to a state that permits other processes to detect that.

When processors can experience Byzantine failures, an ensemble implementing a t fault-tolerant state machine must have at least 2t+1 replicas, and the output of the ensemble is the output produced by the majority of the replicas. This is because with 2t+1 replicas, the majority of the outputs remain correct even after as many as t failures.

In case of fail-stop failures, then it is only necessary t+1 replicas, because only correct outputs are produces by correct processes, and after t failures there is still one process that produces the output.

To implement a fault-tolerant state machine it is necessary to ensure the following:

  • Replica Coordination. All replicas receive and process the same sequence of requests. This can be decomposed into two requirements concerning dissemination of requests to replicas in an ensemble.
  • Agreement. Every nonfaulty state machine replica receives every request.
  • Order. Every nonfaulty state machine replica processes the requests it receives in the same relative order.

The Agreement is concerned to the client interaction with the state machine replicas, and the Order concerns with the behaviour of the state machine replica with respect to the requests. Sometimes, according to the use case, the implementation of the Agreement and Order can be relaxed.

The Agreement can be relaxed for read-only requests when fail-stop processors are being assumed. A request can only be sent to a single nonfaulty state machine replica. This is possible because the response from the replica is always guaranteed to be correct.

Another example of relaxation of the Order property is when 2 requests r and rʹ can commute in a state machine because the result will be always the same independently r is processed before or after rʹ.

It just makes sense to guarantee the Order property if the Agreement property is also guaranteed.

The Agreement requirement is guaranteed if:

  • IC1. All nonfaulty processors agree on the same value.
  • IC2. If the transmitter is non faulty, then all nonfaulty processors use its value as the one on which they agree.

The Order of the requests can be guaranteed if the algorithm uses logical clocks or synchronized real-time clocks to timestamp requests.

Depending on whether the output of the state machine implemented by the ensemble is to be used within the system or outside the system, different architectures must be provided.

If the outputs of the state machine are used by external components, then that device is already a single component whose failure cannot be tolerated. The usual solution to this problem is to replicate the output device and voter. Each voter gets the output of each state machine replica, votes the result and produces a signal that drives one output device. E.g., a flap on an airplane wing might be designed so that when the 2t+1 actuators that control it do not agree, the flap always moves in the direction of the majority (rather than twisting). So, each voter have sent the result to one actuator.

If output devices exhibit only fail-stop failures then it is only necessary to the output device get one result. In this case, it is assumed that the result that the output device receives is correct.

If the outputs of the state machine are used by the client, then the client itself can combine the outputs of state machine replicas in the ensemble. When Byzantine failures are possible, the client waits until it has received t+1 identical responses, each from a different replicas, and takes that as the response from the t fault-tolerant state machine. When only fail-stop failures are possible, the client can proceed as soon as it has received a response.

Thursday 25 June 2015

ADAPT: Availability-aware MapReduce Data Placement for Non-Dedicated Distributed Computing

In this paper, the authors proposed an Availability-aware data placemente (ADAPT) strategy to improve the application performance without extra storage cost. The objective of the data placement algorithm is to find an optimized mapping from data blocks to the nodes, such that all nodes complete their assigned blocks at the same time. They propose an analytical model to estimate the execution time of MapReduce tasks under non-dedicated distributed computing environments. This way, they can mitigate the impact of volatility and heterogeneity of the nodes. ADAPT dynamically dispatches data blocks onto participating hosts based on their availabilities.

ADAPT was implemented within Hadoop MapReduce platform and incurs minor overheads to the existing Hadoop framework.

They perform extensive experiments and simulations to evaluate the feasibility and payoffs of ADAPT. The experimental results show that ADAPT improves application performance by more than 30%.

Sunday 21 June 2015

Consensus with Byzantine Failures and Little System Synchrony

In distributed systems, consensus is the act of having all parts agreeing in choosing one value. E.g., in binary consensus problem, every correct process proposes some value in {0,1} and must make an irrevocable decision on a value such that follow these properties:

  • (Agreement) No two correct processes decide differently;
  • (Validity) If some correct process decides v , then v is proposed by some correct process;
  • (Termination) Every correct process eventually decides some value.

The consensus problem is at the core of fault-tolerant distributed systems. Solving consensus is impossible in asynchronous systems subject to process failures. A well-known way to overcome this impossibility is to make partial synchrony assumptions about the system, like relative speeds of processes are bounded, and all links are eventually timely (a message sent at the time T are delayed at most Δ by the link).

This possibility result holds for a system Scrash with crash failures and a system Sbyz with byzantine failures, with a resiliency of n>2f+1 and n>3f+1, respectively, where n is the number of processes and f is the maximum that can fail.

To solve consensus, is it really necessary that all links be eventually timely? What if only some links are eventually timely, while other links can be arbitrarily slow; can consensus still be solved? How these possibilities work for crash-failure and byzantine systems?

It has been proved in previous work that consensus is possible in a weaker system where at least one unknown faulty process whose f outgoing direct links are eventually timed. This results only works for systems with crash failures.

In this paper, the authors evaluates these possibilities for byzantine systems by trying to solve consensus in a system with all processes with some eventually timed links. They have defined two systems (i) Sbyz where all the links are eventually timed and (ii) Sʹbyz where only the to and from links of the correct processes are eventually timed.

Their consensus algorithm uses consistent unique broadcast as subroutine, where messages have a tag to ensure that (1) correct processes deliver the same set of messages, and (2) a correct process delivers at most one message with a given tag. Tags are used to ensure that a byzantine process does not broadcast two different messages in the same round. They also use provable reliable send that guarantees that all message m sent by a correct process p will be delivered to q and other process r will get the proof about the delivery.

Subroutines

Consistent unique broadcast

Consistent unique broadcast is guaranteed by the following properties:

  • (Validity) If a correct process p broadcasts (X,k,v) then all correct processes eventually cudeliver (X,k,v) from p_ ( (X,k) is the tag, and v is the value) ;
  • (Unforgeability) If a correct process p does not broadcasts (X,k,v) then no correct process will ever deliver it.
  • (Uniqueness) For each tag (X,k) and q, a correct process delivers at most one message with the tag. In other words, there is no correct process that will deliver the same message more than once.
  • (Relay) If a correct process delivers (X,k,v) from p then all correct processes will eventually deliver the message.

Provable reliable send

Provable reliable send guarantees that if p is correct then all correct processes r gets the proof that m is in transit, and if a correct process r gets the proof that m is in transit, and q is correct, then q receives m. In other words, a process r gets the proof that m is in transit if p is a correct process and have sent m, and q is a correct process and have received m. The provable reliable send follows these properties:

  • (Integrity) A correct process q receives m from a correct process p at most once, and only if p has previously sent m to q;
  • (Validity) If some correct process p sends m to some correct process q then eventually q receives m from p;
  • (Proof-Integrity) If some correct process r gets the proof of m from some process p to some correct process q then q receives m from p;
  • (Proof-Validity) If some correct process p sends m to some process q then every correct process r gets the proof of m from p to q.
  • (Eventual timeliness) If process q is a bisource (a process whose incoming and outgoing links are eventually timely) then there exists Δʹ and Tʹ such that if some correct process r gets the proof of m from some process p to process q at time t then q receives m from p by time max{t,Tʹ}+Δʹ. In other words, if r have the proof of m being sent at the time t from p, then the message m must be delivered at a time max{t,Tʹ}+Δ .

Consensus

They use the previous subroutines to implement consensus. They have derived their consensus algorithm from Ben-Or's randomized algorithm. Here, each process pi=1...n keeps a current estimate of the decision value, which is initially the value that p0 proposes to consensus. The algorithm proceeds by rounds, where each round has four phases: certification, reporting, proposing, and consulting the coordinator where all process exchange the proposed v until all agree on the same value.

They have concluded with this algorithm that consensus is possible in system Sbyz, in which all non-faulty process whose incoming and outgoing links are all eventually timely. This means that consensus is not possible in a Sʹbyz systems (some correct process does not have all links eventually timed). In other words, it is not possible to tolerate byzantine failures if it exists at least one non-faulty process s whose n1 or f outgoing links are eventually timed (in this situation, consensus is only possible in system that tolerates crash faults).

Sunday 14 June 2015

Several types of synchrony

The concept of partial synchrony in a distributed system is introduced, and it lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound Δ on the time required for a message to be sent from one processor to another and a known fixed upper bound Φ on the relative speeds of different processors. In an asynchronous system no fixed upper bounds Δ and Φ exist. In a partial synchronous system, in one version, the fixed bounds Δ and Φ exist, but they are not known a priori. On the other version, the bounds are know but they are only guaranteed to hold starting some time T.

One way to compare the several models of synchrony is with the problem of agreement. So, lets consider a collection of N processors, p1,,pN, which communicate by sending messages to one another. Initially each processor pi has a value vi drawn from some domain V of values, and the correct processors must all decide on the same value; moreover, if the initial values are all the same, say v, then v must be the common decision. In addition, the consensus protocol should operate correctly if some of the processors are faulty, for example, if they crash (fail-stop faults), fail to send or receive messages when they should (omission faults), or send erroneous messages (Byzantine faults).

Given the assumption about synchronism of the message delivery and the processor speed, we can characterize the model of the consensus by its resiliency - the maximum number of faults that can be tolerated. For example, it might be assumed that there is a fixed upper bound Δ on the time for messages to be delivered (communication is synchronous) and a fixed upper bound Φ on the rate at which one processor's clock can drift than another's (processors are synchronous), and that these bounds are known a priori and can be "built into" the protocol.

In the case that there are no bounds for message delivery or processor speed (asynchronous systems), it is impossible to have consensus because, in the case of process failures, we cannot know if a process has failed, or the message delivery is slow.

A well-known way to overcome this impossibility is to make partial synchrony assumptions about the system. The consensus problem is possible in this model if the speed of the processes are bounded and all links are eventually timely. If there is a process that the message is not delivered at a time T+Δ, the process is considered failed. This possibility result holds for crash failures (Scrash) and byzantine failures (Sbyz). Thus, consensus is achieved from n2f+1 and s3f+1 for crash and Byzantine failures, respectively, where n is the number of processes and f is the maximum number of processes that can fail.

Therefore, for the consensus problem, the resilience of the algorithm depends on the system model that we define.

Based on Cyntha Dwork, et al. paper [1], here is a table that show how many processes N are necessary to tolerate f faults in all type of systems.

Failure type Synchronous Asynchronous Partially synchronous communications and synchronous processors Partially synchronous communication and processors Partially synchronous processors and synchronous communication
Fail-stop f 2f+1 2f+1 f
Omission f 2f+1 2f+1 [2f, 2f+1]
Authenticated Byzantine f 3f+1 3f+1 2f+1
Byzantine 3f+1 3f+1 3f+1 3f+1

[1] Cynthia Dwork, et al., Consensus in the Presence of Partial Synchrony, 1988

Thursday 11 June 2015

Smart redundancy for Distributed Computation

Many software today are distributed systems composed by a large numbers of autonomous software and hardware participants that interact over untrusted networks. There are distributed data stores (e.g., Freenet), peer-to-peer A/V streaming applications (e.g., Skype), and distributed computation architectures (DCA) which solve massive problems by deploying highly parallelizable computations (i.e., sets of independent tasks) to dynamic networks of potentially faulty and untrusted computing nodes (e.g. Hadoop). These systems use redundancy mechanisms to tolerate faults and achieve acceptable levels of reliability.

In this paper, the authors focus on the existent redundancy mechanisms (traditional and progressive redundancy) for DCA and present a new method called iterative redundancy which ensures efficient replication of computation and data given finite processing and storage resources, even when facing Byzantine faults. They claim that iterative redundancy is more efficient and more adaptive than comparable state-of-the-art techniques (traditional and progressive redundancy) that operate in environments with unknown system resource reliability.

Traditional redundancy

This k-vote redundancy performs k ∈ {3,5,7,...} independent executions of the same task in parallel, and then takes a vote on the correctness of the result. If at least some minimum number of executions agree on a result, a consensus exists, and that result is taken to be the solution. Briefly, the algorithm just needs a minimum number of matching of the same result (e.g., majority - k+12) to find consensus.

Progressive redundancy

Sometimes traditional redundancy reaches a consensus quickly but still continues to distribute jobs that do not affect the task's outcome. Progressive redundancy minimizes the number of jobs needed to produce a consensus. It starts a minimum of k+12 jobs and checks if they return the same result. If so, there will be a consensus and the results produced by any subsequent jobs of the same task become irrelevant. If some nodes agree, but not enough to produce a consensus, the task server automatically distributes the minimum number of additional copies of the job necessary until it reaches a consensus.

Iterative redundancy

Progressive redundancy is guaranteed to distribute the fewest jobs to achieve a consensus. In contrast, iterative redundancy is guaranteed to distribute the fewest jobs needed to achieve a desired system reliability.

There is a confidence level that varies according the apparent risk of failure of a task. The user specifies how much improvement it needs, and the system uses the available resources to achieve the highest possible system reliability.

The iterative redundancy distributes the minimum number of jobs required to achieve a desired confidence level in the result. Then, if all jobs agree, the task is completed. However, if some results disagree, the confidence level associated with the majority result is diminished and distributes the minimum number of additional jobs that would achieve the desired level of confidence.

This algorithms does not require knowledge of node reliability and can thus be applied to a wider class of systems than credibility-based fault tolerance and blacklisting. For instance, in volunteering computing, the system has no information about new volunteers. If the system collects information about the reliability of nodes over time, malicious nodes that have developed a bad reputation can change their identity. In iterative redundancy this does not happen.

Comparison between redundancies

Progressive and iterative redundancy need to deploy several jobs and wait for the responses before possibly choosing to deploy more. Traditional redundancy launches all the tasks and looks for a consensus based on all results. At first sight, it seems that the response time is much larger for progressive and iterative redundancy than in traditional redundancy. But, in the realm of DCAs, as the number of tasks is far larger than the number of nodes, so the increased response time does not present a problem because the nodes can always execute jobs related to other tasks. There will be few times that a full execution just needs to wait for the failed tasks finish successfully. In other words, no node will ever be idle and all nodes processing capability will be fully utilized.

They have concluded in the evaluation that the average response time for tasks is bigger (1.4 and 2.8) in iterative redundancy than in the progressive and traditional technique, respectively. Moreover, iterative redundancy can guarantee better reliability than the other techniques by the same cost factor. Iterative redundancy peaks at 2.8 times better than traditional redundancy in terms of reliability.

Cost factor

In the above text I have used the term cost factor. To characterize the behaviour of each technique, they derive formulae for two measures of their effect on systems: the system reliability R(r) achieved by and the cost factor C(r) of applying the redundancy technique.

Taking the traditional redundancy example to try to understand what is the cost factor, the reliability of k-vote redundancy is the probability that at least a consensus of jobs does not fail. The cost represents how many tasks will be needed to launch to achieve the desired reliability.

Monday 8 June 2015

Heading off correlated failures through Independence-as-a-Service

Cloud services normally require high reliability, and rely on redundancy techniques to ensure this reliability. Contrary as expected, seemingly independent infrastructure components, however, may share deep, hidden dependencies. Failures in these shared dependencies may lead to unexpected correlated failures, undermining redundancy efforts. For instance, Amazon S3 replicates each data object across multiple racks in an S3 region, although a failure on a main switch can compromise the access to the infrastructure.

Discovering unexpected common dependencies is extremely challenging, and many of them are diagnosticaded after they have occurred. These retroactive approaches require human intervention, leading to prolonged failure recovery time.

Worse, correlated failures can be hidden not just by inadequate tools or analysts within one cloud provider, but also by non-transparent business contracts between cloud providers and lower-level services

In this paper, they propose an Independence-as-a-Service or INDaaS, a novel architecture that aims to address the above problems proactively. Rather than localizing and tolerating failures after an outage, INDaaS collects and audits structural dependency data to evaluate the independence of redundant systems before failures occur.

This proactive strategy uses acquisition modules that collect dependency data and adapt them into common format, and an auditing agent that employs a similarly pluggable set of auditing modules to quantify the independence of redundant systems and identify common dependencies that may introduce unexpected correlated failures. At the end, an auditing report quantifies the independence of various redundancy deployments, optionally computing some useful information such as the estimates of correlated failure probabilities and ranked lists of potential risk groups.

This is a very thorough paper that details every step of INDaaS. They also present the concept of Risk Group (RG) and a couple of algorithms (Minimal RG and failure sampling alg.) that they use to build and rank RG for auditing. It is with this data that the service will build the final report.

They have evaluated the service in three small "but realistic" case study. They have emulated a real data center topology using 4 servers, and 4 switches. Those 4 servers have 8 VMs running in total. With the tests they have found out that only 14% of probability a user is able to put a service running in servers that does not suffer from correlated faults. As result, INDaaS auditing results gave them hints about weakest parts of the topology. They have also compared the RG algorithms, and concluded that the failure sampling algorithm runs much more efficiently than the minimal RG algorithm and still achieving a reasonable high accuracy. The failure sampling algorithm took 96 minutes to find 92% of all the RGs, in comparison to 1046 minutes for the minimal RG algorithm.

In my opinion the evaluation section has a lot to improve. They did not proof how we can relate the topology results with a real-case scenario. In overall, they have explained thoroughly INDaaS, but presented a shallow Evaluation section.

Risk Group

In redundant systems, a risk group (RG) is a set of components whose simultaneous failures could cause a service outage. Suppose some service A replicates critical state across independent servers B, C and D located in 3 separated racks. The intent of this 3-way redundancy configuration is to all 3 RG be the size of 3, i.e., 3 servers must fail simultaneously to cause an outage. Now imagine that these 3 racks share the same switch. You can see easily that, if the switch become unavailable, the 3 racks will also become unavailable. In this case, a common dependency introduced a RG whose failure could disable the whole service despite redundancy efforts. Also, correlated failures can be hidden not just by inadequate tools or analysts within one cloud provider, but also by non-transparent business contracts between cloud providers. One time, a storm in Dublin recently took down a local power source and its backup generator, disabling both the Amazon and Microsoft clouds in that region for hours In this paper, the authors propose a novel architecture called Independence-as-a-service or INDaaS that aims to collects and audits structural dependency data to evaluate the independence of redundant systems before failures occur.

Discovering unexpected common dependencies is challenging. Many diagnostics attempts to tolerate faults after they have happened. Most of the times, it requires human intervention. Worse, correlated faults can be hidden by not just by inadequate tools or analysts, but also by private contracts between cloud providers.