Tuesday, 26 May 2015

Taming uncertainty in distributed systems with help from the network

In this paper, the authors present Albatross, a service that quickly reports to applications the current status of a remote process—whether it is working and reachable, or not. If Albatross reports a process as "disconnected", it is safe to assume that process cannot affect the world. Albatross is targeted at data centers equipped with software-defined networks (SDNs). Using SDN functionality, Albatross receives notifications about the state of the network, determine which processes are reachable and enforce their determinations by installing drop rules on switches.

The processes are disconnected permanently by the service, and they are only reconnected after they have rollbacked their state to some checkpoint that causally precedes Albatross's disconnected report. By rolling back their state, excluded processes accept their effective deaths, and can be safely reintegrated using standard catch-up techniques (eg, replay).

This service can be useful in different situations. The authors stated that i) Albatross detects network failures an order of magnitude more quickly than the ZooKeeper membership service; ii) integrating RAMCloud with Albatross prevents clients from communicating with servers that have been declared failed, which eliminates a consistency bug in RAMCloud; iii) they have also implemented Zab protocol that uses Albatross - Aab. Aab has a smaller description, fewer phases, fewer round-trips, fewer message types, and fewer counters for ordering messages. Moreover, it tolerates the failure of all but one process. By contrast, Zab tolerates the failure of fewer than half of the processes.

In conclusion, Albatross detects network and processes problems faster than Zookeeper and Falcon, it handles failures at end-hosts, making it a complete solution for failure detection in a data center environment, and it can be also theoretically considered as the first service to apply modern networking techniques to refine the guarantees of distributed systems.

No comments:

Post a Comment