Computer Science Talks: January 2019

Many people connected to the computer business in several distinct roles (programmers, architects, and many more) don't know the concept of Byzantine failures. If there's someone who has heard this term once, most likely forgot the term immediately.

In the most general sense, Byzantine failures are defined as arbitrary deviations of a process due to any anomaly. The term "any" really means any fault (permanent or occasional) that occurred in the software or the hardware. This type of faults are pivotal in a distributed systems world and are often neglected when creating systems, or when debugging them. This clear unrecognition makes this topic confined to the academy or singular projects.

Let's separate the concepts of Byzantine faults and failures because I will use them interchangeably.

Byzantine faults: a fault presenting different symptoms to different observers.
Byzantine failure: the loss of a system service due to a Byzantine fault.

This paper from Driscoll et al. revisits the Byzantine problem from the practitioner's perspective and alerts everyone that this type of faults should not be neglected.

The term Byzantine failure was proposed a long time ago by Leslie Lamport et al. (The Byzantine Generals Problem). They present the scenario of a group of General divisions that surround an enemy camp. The Generals must communicate to each other to agree on a plan of action -- whether attack or retreat. If they all attack at the same time, they will win. If they retreat, they will live another day. If just part of the division attack, then the generals will die.

This sounds like an easy problem, but in fact, all generals are surrounding the enemy, and they can only communicate with each other using the messenger. The messenger can be killed or replaced by the enemy, or the enemy can change the message. Moreover, some generals may be traitors and send inconsistent messages to disrupt loyal generals from reaching consensus.

The Lamport paper discusses the problem in the context of oral messages and written signed messages. Regarding oral messages, it has been proven that to tolerate f traitorous generals, it is necessary 3f+1 generals and f+1 rounds of information exchanged. If it is used signed messages to prove authenticity from the senders, it is possible to reduce the number to 2f+1 generals and f+1 rounds of information exchange.

Consensus is an integrated part of a Byzantine fault tolerant system. Addressing the consensus problem at the protocol level reduces the software complexity, but leaves the protocol itself vulnerable to Byzantine failures.

The common use of COTS components increases the probability of a system to suffer a Byzantine fault. Good hardware is not also sufficient! Small electric perturbations in the components can guide the hardware to a have a 1/2 input. For example, the CPU suffered a perturbation in the voltage and emitted a bit that it is not possible to know if it was a 0 or 1. Back-propagation can even amplify the error, and the whole system can fail. Moreover, this error can become permanent, and the system will never achieve a correct state.

This paper presents a Time-Triggered Architecture (TTA) as a generic architecture for fault-tolerant distributed real-time systems. This architecture is targeted to the automative industry where communication requisites are high. This architecture implements a deterministic fault-tolerant communication protocol (TTP/C) that guarantees synchronization and deterministic message communication. This service provides membership service to allow a global consensus on message distribution and can tolerate this type of faults as long as the errors are transient. By radiating components with ions, was sufficient to introduce faults in the components. This careful design resulted in the introduction of centralized filtering to remove invalid signals (e.g., 1/2 signal) into valid logic signals.

This papers also presents a Multi-Microprocessor Flight Control System (MMFCS) that was developed in the '70s. This system introduced the concept of self-checking pairs and used a dual self-checking pair bus distribution topology. The self-checking pair comparisons of processors, bus transmission and reception enabled a precise detection of Byzantine faults. The MMFCS observations concluded that Byzantine faults are not as rare as we can suppose. They can run permanently or intermittently. A misconception is to assume that random events are uniformly distributed in time. In reality, the precipitating conditions can make a fault persist. Similarly, it is a myth that Byzantine faults won't propagate to other components. In the worst case scenario, a Byzantine fault can shut down the whole system.

Effective methods for dealing with Byzantine faults can be divided into three types: full exchange, hierarchical exchange, and filtering topologies. These methods can be used in conjunction or separately. The first method implements the exchanges described in the Lamport paper. The second method uses private exchanges inside subsets of a system's node followed by simplified exchanges between the subsets. The third method tries to remove Byzantine faults via filtering. The paper presents examples of implementation of these three methods. Although these examples are implemented in the hardware, there are also other software solutions.

In overall, Byzantine faults are very real, and nowadays the use of COTS components make this type of faults even more severe. It is very hard or impossible to design a Byzantine fault tolerant system without consensus. Consequently, the term "Byzantine faults" should be in the head of every software and hardware engineer, and this paper gives credible examples. In my personal view, I hope that companies that produce hardware and software start to include this topic in their plans.

[1] Byzantine Fault Tolerance, from Theory to Reality, K. Driscoll, B. Hall, et. al, Computer Safety, Reliability, and Security
[2] The Byzantine Generals Problem, L. Lamport, R. Shostak, M. Pease, ACM

Tuesday, 29 January 2019

Byzantine Fault Tolerance, from Theory to Reality