Monday 1 December 2014

On the Efficiency of Durable State Machine Replication (SMR)

Paper here

This is a paper that tries to evaluate what happens when we apply durability (logging, checkpoints, and state transfer) in a modular way inside a SMR.

In the Paxos algorithm, we have a set of servers that run a Total Order Multicast, e.g. Paxos agreement, and this protocol ensures that all servers run the same sequence of operations. In this paper, they modified the SMR, and make it durable (executions are save in disk). Even if all machines crash, the client will see the operation result after recovery. This kind of system is important because in many papers, people assume in implementing Paxos in some critical components to avoid single point of failure. We can see this is coordination services like Zookeeper. In some sense, Paxos/SMR are a critical component of modern internet-scale infrastructure.

Durability is important because sometimes we need to shutdown the system, and correlated failures1 happen more often than expected. In practice, application, replication and durability code is mixed with the code for efficiency, and can turn the code very complex, and very difficult to make it right.

There are already papers that talks about agreement protocols (PBFT, Zyzzyva), but they care about synchronizing replicas in the normal execution and do not consider durability (logging, checkpoints, and state transfer). In this paper, they talk about SMR, but the problem of tolerating faults is orthogonal to what they show (i.e., it is not really relevant here). They considered a system model of n replicas that tolerates at most f faults (crash faults: n > 2f, or arbitrary faults: n > 3f). The important thing is that they considered non-synchronous systems - systems that work with quorums - and they require that n − f replicas to set an message order and execute them.

In terms of programming model, the SMR on the server side and the service components can set and get the states. They have a key-value store as a memory table, and a disk log as a permanent file. When a read command is sent, you from read the memory table, and when the write command is sent, you write in the memory table and append the operation in the disk log.

If you use stable logging, writing to a disk cost time. The values below show the throughput of writing (4kB/sec) that is possible to do.

Memory 4772 w/sec
Async Disk 4312 w/sec
Sync Disk 63 w/sec
Sync SSD 1017 w/sec

We can see, they it is possible to write 4772 times per second into memory. Writing into an async disk takes up 4312 operations per second. The problem of writing in asynchronously is, if you are really unlucky, you can loose some operations. Buying SSDs cost a lot, and it solves partially the problem. The idea of the paper is to close the value as much as possible to the values of writing into memory.

If you have checkpoints, you need to control the size of the log for two reasons:

  1. We need to control the size of the log to be stored.
  2. It takes time to recover the log.

The solution is to have periodic snapshots. In SMR, the normal way is to take periodic snapshots in some log size.

When the snapshots happen, the service becomes slow and the system can become unavailable to accept client requests. A solution to this is to use copy-on-write 2, but this adds complexity to the code. Another solution is to use Fuzzy snapshots 3, but it requires additional disks.

When it is necessary to do a state transfer because a replica crashed and recovered, another replica is picked up to do the state transfer and the service must be stopped because we need a quorum to execute requests. A way to solve this is to have additional replicas to deal with recoveries, or avoiding answering client requests.

When we integrate the referred problems and solutions to the code, it can make the code very complex.

In the paper they show several solutions to take care of the same problem without changing the SMR code. Three new techniques are proposed, like parallel logging, sequential checkpoints and a collaborative state transfer.

Parallel logging

To have parallel logging, it is necessary that a reply should only be sent after the state and the log are updated. Also the disk bandwidth must be very good (it takes approximately the same time to write 1kB or 1MB).

What it is proposed is to have an agreement layer that keep ordering requests. The requests are batched, and, at same point, they are sent to a thread that appends the values to the memory table and to the disk log. After the thread finishes, the result is replied to the client.

Sequential checkpoints

Normally, checkpoints happen in a synchronized way - all replicas take the same checkpoint at the same time. The benefit from this solution is that it is very easy to recover a machine with the correct checkpoint. The drawback is that it can make the system unavailable to answer client requests. The authors propose the sequential checkpoint - each server do the checkpoint at a time and there are more than one replica doing checkpoint at the same time - in order to avoid perturbations of the system during this time. The drawback of the sequential checkpoint is that it complicates state transfer.

Collaborative State Transfer

The idea is to ask a several replicas sending parts of the current state. One replica will send the checkpoint, and the remaining ones will send different parts of the log. We can see the optimized solution in the figure (b). In the General CST, just one replica must send the state to the recovered host.

Evaluation

They implemented these features in a durability layer on BFT-SMaRt4 replication library.

In figure 9, they show the number of operations that happen in parallel logging. Since they are stressing the system to its maximum, we can see that with parallel logging the results are similar as a pure memory system (red line in Figure 9 (a)). Also, for large writes, the parallel logging in Disk can be slightly better than with SSD, because the disks absorb more data and can reach pure memory throughput. But, SSDs win in terms of latency.

In the Figure 10, we can see the throughput during sequential checkpointing. In these graphs there are 2 execution: one in high load, and another in medium load for 500MB or 1GB state. Just focusing in the high load in Figure 10 (b), we can see 2 perturbations in the logging related to lagging replica synchronization. In this case, there are some replicas that are in front of others, and so the system becomes slow so that the lagging replica start to pick-up the pace and participate on the agreement. In medium load, this problem never happens because there is always time for the replicas to keep up with the system.

In the Figure 11, we can see what happens when a replica is recovering. In this case, they are considering f = 1 (n = 4) with state validation.

When they try collaborative state transfer (CST), they try to spread the workload among the 3 replicas. We can still see perturbations, but it is still a better result than using a single state transfer. The CST is represented with the blue line, and we can see a perturbation between 90s-120s (Figure 10 (b)). The CST perturbation is one third of the single state transfer, and this happens because, in this execution, they are trying to get part of the logs with the same size as the checkpoints.

In the usual state transfer protocol (you ask the state from a single replica), they get 24% of the normal throughput during state transfer, and 60% with CST. Summing up, it is much better to use CST than sequential state transfer, but the protocol is more complex.

In conclusion, durable SMR is a fundamental technique for critical services, and it is complex and difficult to do it right. In this paper, the authors showed a way to deal with the durability overhead without breaking modularity and the SMR model. They also presented 3 techniques like, parallel logging, sequential checkpoints, and collaborative state transfer.


  1. http://psteitz.blogspot.pt/2011/10/correlated-failure-in-distributed.html

  2. Copy-on-write is the name given to the policy that whenever a task attempts to make a change to the shared information, it should first create a separate - private - copy of that information to prevent its changes from becoming visible to all the other tasks.

  3. Zookeeper takes a snapshots at random time while the system is processing. This works in Zookeeper because they do not synchronize checkpoints with logging and it uses idempotent operations. Even when you are recovering a replica and you have an operation on the logging or on the checkpoint, you can execute the operation many times as you want, that the final state will always be the same.

  4. https://code.google.com/p/bft-smart/

No comments:

Post a Comment