Computer Science Talks: Improving availability in distributed systems with failure informers

When a distributed system acts on failure reports, the system's correctness and availability depend on the granularity and semantics of those reports. The system's availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). The availability is a paramount concern of distributed applications, and it is important that the systems can recover quickly when a failure happens. Existing mechanisms for reporting failures are coarse-grained, lack coverage, lack accuracy, or do not handle latent failures.

The paper presents Pigeon, a service for reporting host and network failures to highly available distributed applications. The application accurately detects a failure, and report it back in order to the system recover immediately. Pigeon classifies failures into four types: whether the problem certainly occurred versus whether it is expected and imminent, and whether the target is certainly and permanently stopped versus not.

Pigeon has several limitations and operating assumptions. First, Pigeon assumes a single administrative domain (but there are many such networks, including enterprise networks and data centers). Second, Pigeon requires the ability to install code in the application and network routers (but doing so is viable in single administrative domains). Third, for Pigeon to be most effective, the administrator or operator must perform environment-specific tuning (but this needs to be done only once)

Pigeon uses sensors that must detect faults quickly and confirm critical faults; the latter requirement ensures that Pigeon does not incorrectly report stops. The architecture accommodates pluggable sensors, and our prototype includes four types: a process sensor and an embedded sensor at end-hosts, and a router sensor and an OSPF sensor in routers.

There is also a component called interpreter that gathers information about faults and outputs the failure conditions. The interpreter must (1) determine which sensors correspond to the client- specified target process, (2) determine if a condition is implied by a fault, (3) estimate the condition's duration, (4) report the condition to the application via the client library, and (5) never falsely report a stop condition.

Pigeon is not suitable for applications like DNS because the client recovery is lightweight, so there is little benefit over using short end-to-end timeouts, since the cost of inaccuracy is low. Some applications do not make use of any information about failures; such applications likewise do not gain from Pigeon. For example, NFS (on Linux) has a hard-mount mode, in which the NFS client blocks until it can communicate with its NFS server; this NFS client does not expose failures or act on them.

Computer Science Talks

Wednesday, 25 November 2015

Improving availability in distributed systems with failure informers

No comments:

Post a Comment