Byzantine failures are defined as arbitrary deviations of a process from its assumed behavior based on the algorithm it is supposed to be running and the inputs it receives. Such failures can occur, e.g., due to a software bug, a (transitional or permanent) hardware malfunction, or a malicious attack.
The term Byzantine refers to the Byzantine Generals' Problem, an agreement problem described by Leslie Lamport, Robert Shostak and Marshall Pease in 1982 paper, "The Byzantine Generals Problem". In this paper, they described a problem between Generals during a battle to attack a city. In its simplest form, every general must agree to attack to win the city, or to retreat. Those general are far away from each other, and the only way to reach an agreement is sending the same message to all of them through the messengers. The problem is complicated by the presence of traitorous generals who may cast different votes to different parts.
This problem can be reduced to solving a "Commander and Lieutenants" problem where Loyal Lieutenants must all act in unison and that their action must correspond to what the Commander ordered in the case that the Commander is Loyal.
- One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third of the generals ($f<\frac{n}{3}$);
- A second solution requires unforgeable message signatures which can provide Byzantine fault tolerance in the presence of an arbitrary number of traitorous generals.
In 1999, Miguel Castro and Barbara Liskov introduced the "Practical Byzantine Fault Tolerance" (PBFT) algorithm, which provides high-performance Byzantine state machine replication very efficiently. PBFT triggered Byzantine fault tolerant replication research with protocols like Zyzzyva, BFT-SMaRt, and others.
State machine replication (SMR) or state machine approach (SMA) is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. SMR can used to tolerate fail-stop failures and also Byzantine failures. The idea is to implement a service using a set of server replicas in such a way that the overall service can continue to behave as specified even if a number of servers is faulty. If the service is designed to tolerate arbitrary faults, which include attacks and intrusions, then the service can be said to be intrusion-tolerant, or Byzantine resilient.
The resilience of a protocol can be defined as the maximum number of faults in the presence of which the protocol still behaves according to its specification. Practical intrusion-tolerant replicated systems based on the state machine approach can handle at most $f$ Byzantine components out of a total of $n=3f+1$, which is the maximum resilience in asynchronous systems.
Although most of the companies are more concerned with fail-stop failures, it does not mean that systems that tolerates arbitrary faults are useless. The Bitcoin network works in parallel to generate a chain of Hashcash style proof-of-work. The proof-of-work chain is the key to overcome Byzantine failures and to reach a coherent global view of the system state.
Any company operating at a scale like IBM/Google/Yahoo/Microsoft/Apple cares, because arbitrary faults happen and when you have 500.000 or more computers, and it is difficult to just have IT guys deal with it. One example, is that these companies implemented the Paxos algorithm in some of their services.
