Computer Science Talks: Failure detectors

In distributed computing, a failure detector is an application or a subsystem that is responsible for detection of node failures or crashes in a distributed system.
There are 2 mechanisms of failure detection:

Heartbeat mechanism: The heartbeat mechanism monitors the connection between a manager and an agent, and are sent by the agent to indicate that it is alive. When the messages are not received, the manager can consider the agent faulty.
Probe mechanism: Each node require to probe another node. Basically, one node sweeps all other nodes, and see who has replied. The probe mechanism is the origin of the failure detectors.

Failure detectors must guarantee safety and liveness.

Safety: safety properties informally require that "something bad will never happen". Safety properties can be violated by a finite execution of a distributed system.
Liveness: liveness properties informally require that "something good eventually happens". A liveness property cannot be violated in a finite execution of a distributed system because the "good" event might only theoretically occur at some time after execution ends. Eventual consistency is an example of a liveness property.

Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. A system that has achieved eventual consistency is often said to have converged.

If "perfect channels" are available, heartbeat exchanges meet strong accuracy and strong completeness. Such detector is called "Perfect" failure detector - if a node crashes, all correct nodes with notice the absence of the heartbeat. But in the internet, there is no perfect channel. If we knew how much packets will be lost in the network, lets say k packets, we just needed to send k + 1 packets to ensure a perfect detector. But this is not possible due to the lack of bounds on the number of communication faults that may happen in a channel. Perfect failure detectors cannot work on the internet, or in asynchronous systems. Therefore, there are weaker failure detectors models.

The explanation of the several types of failure detectors can be viewed in my post called " Unreliable failure detectors for reliable distributed systems"

Computer Science Talks

Friday, 7 November 2014

Failure detectors

No comments:

Post a Comment