Tuesday, 26 May 2015

Taming uncertainty in distributed systems with help from the network

In this paper, the authors present Albatross, a service that quickly reports to applications the current status of a remote process—whether it is working and reachable, or not. If Albatross reports a process as "disconnected", it is safe to assume that process cannot affect the world. Albatross is targeted at data centers equipped with software-defined networks (SDNs). Using SDN functionality, Albatross receives notifications about the state of the network, determine which processes are reachable and enforce their determinations by installing drop rules on switches.

The processes are disconnected permanently by the service, and they are only reconnected after they have rollbacked their state to some checkpoint that causally precedes Albatross's disconnected report. By rolling back their state, excluded processes accept their effective deaths, and can be safely reintegrated using standard catch-up techniques (eg, replay).

This service can be useful in different situations. The authors stated that i) Albatross detects network failures an order of magnitude more quickly than the ZooKeeper membership service; ii) integrating RAMCloud with Albatross prevents clients from communicating with servers that have been declared failed, which eliminates a consistency bug in RAMCloud; iii) they have also implemented Zab protocol that uses Albatross - Aab. Aab has a smaller description, fewer phases, fewer round-trips, fewer message types, and fewer counters for ordering messages. Moreover, it tolerates the failure of all but one process. By contrast, Zab tolerates the failure of fewer than half of the processes.

In conclusion, Albatross detects network and processes problems faster than Zookeeper and Falcon, it handles failures at end-hosts, making it a complete solution for failure detection in a data center environment, and it can be also theoretically considered as the first service to apply modern networking techniques to refine the guarantees of distributed systems.

Wednesday, 13 May 2015

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection

In this article presented at NSDI'15, Suresh et al. present an adaptive replica selection mechanism, C3, that works with Cassandra and it is robust to performance variability in the environment.

Systems that respond to user actions very quickly (within 100 milliseconds) feel more fluid and natural to users than those that take longer. Improvements in Internet connectivity and the rise of warehouse-scale computing systems have enabled Web services that provide fluid responsiveness while consulting multi-terabyte datasets that span thousands of servers.

Large online services that need to create a predictably responsive whole out of less predictable parts are called latency tail-tolerant, or tail-tolerant for brevity.

It is challenging to deliver consistent low latency application on the internet. There are some web applications that use databases to retrieve data, and still require low and predictable latencies. Other applications like Hadoop, it is used for distributed computing, and it needs to query big data. If you have a web application that processes client requests using Hadoop, you don't want to keep the client waiting forever for the computation result because you have created a bottleneck by submitting all jobs to the same runtime. You need to find a way to deliver a fast response.

A recurring pattern to reducing tail latency is to exploit the redundancy built into each tier of the application architecture, wherein a client node has to make a choice about selecting one out of multiple replica servers to serve a request.

The replica selection strategy has a direct effect on the tail of the latency distribution. This is particularly so in the context of data stores that rely on replication and partitioning for scalability, such as key-value stores. Replica selection can compensate for these conditions by preferring faster replica servers whenever possible.

C3, an adaptive replica selection mechanism that is robust in the face of fluctuations in system performance. C3 uses a combination of in-band feedback from servers to rank and prefer faster replicas along with distributed rate control and backpressure in order to reduce tail latencies in the presence of service-time fluctuations.

Through comprehensive performance evaluations, they have demonstrated that C3 improves Cassandra’s mean, median and tail latencies.



Monday, 4 May 2015

List of the most used algorithms

Our world is full of algorithms that we find everywhere. But there are some algorithms that are more used than anothers. You can check it here.