Wednesday 23 November 2016

XFT: Practical Fault Tolerance Beyond Crashes

In the OSDI'16, it was introduced a new fault tolerant approach called XFT to build reliable and secure distributed systems to tolerate crash and non-crash (Byzantine) faults. Paper here.

XFT allows to design reliable protocols that tolerate crash machine faults regardless on the number of network faults, and, at the same time, tolerate non-crash machine faults when the number of faulty machines or partitioned are within a threshold.

The intuition behind the XFT protocol is that conditions when there is a total control of the network and the nodes are very rare. Therefore, using most of the BFT protocols in applications are excessive and a little over the top. The XFT is for the cases in which an adversary cannot easily coordinate enough network partitions and non-crash faulty machine actions at the same time. XFT can be used to protect against "accidental" non-crash faults, which happen occasionally. Or, to protect against malicious non-crash faults as long as the attacker do not perform a coordinated attack to compromise Byzantine machines and partition the network under a certain threshold. Or, the attacker cannot compromise the communication among a large number of correct participants.




State-of-the art asynchronous CFT protocols guarantee consistency despite any crash faults or  n-1 partitioned replicas. They also guarantee availability whenever a majority of replicas are correct.

In the case of XFT, guarantee consistency in two modes: (i) without non-crash faults, despite any number of crash-faulty and partitioned replicas, and (ii) with non-crash faults, whenever a majority of replicas are correct and not partitioned, i.e., provided the sum of all kinds of faults (machine or network faults) does not exceed a majority of correct process. Similarly, it also guarantees availability whenever a majority of replicas are correct and not partitioned.

The consistency guarantees of XFT are incomparable to those of asynchronous BFT. On the one hand,  XFT is consistent with the number of non-crash faults in the range [n/3; n/2), whereas asynchronous BFT is not. On the other hand, unlike XFT, asynchronous BFT is consistent when the number of non-crash faults is less than n/3.

In this paper, the authors also present XPaxos, a Paxos-like algorithm designed specifically in the XFT model. I will explain this protocol in another post.