Computer Science Talks: Heading off correlated failures through Independence-as-a-Service

Cloud services normally require high reliability, and rely on redundancy techniques to ensure this reliability. Contrary as expected, seemingly independent infrastructure components, however, may share deep, hidden dependencies. Failures in these shared dependencies may lead to unexpected correlated failures, undermining redundancy efforts. For instance, Amazon S3 replicates each data object across multiple racks in an S3 region, although a failure on a main switch can compromise the access to the infrastructure.

Discovering unexpected common dependencies is extremely challenging, and many of them are diagnosticaded after they have occurred. These retroactive approaches require human intervention, leading to prolonged failure recovery time.

Worse, correlated failures can be hidden not just by inadequate tools or analysts within one cloud provider, but also by non-transparent business contracts between cloud providers and lower-level services

In this paper, they propose an Independence-as-a-Service or INDaaS, a novel architecture that aims to address the above problems proactively. Rather than localizing and tolerating failures after an outage, INDaaS collects and audits structural dependency data to evaluate the independence of redundant systems before failures occur.

This proactive strategy uses acquisition modules that collect dependency data and adapt them into common format, and an auditing agent that employs a similarly pluggable set of auditing modules to quantify the independence of redundant systems and identify common dependencies that may introduce unexpected correlated failures. At the end, an auditing report quantifies the independence of various redundancy deployments, optionally computing some useful information such as the estimates of correlated failure probabilities and ranked lists of potential risk groups.

This is a very thorough paper that details every step of INDaaS. They also present the concept of Risk Group (RG) and a couple of algorithms (Minimal RG and failure sampling alg.) that they use to build and rank RG for auditing. It is with this data that the service will build the final report.

They have evaluated the service in three small "but realistic" case study. They have emulated a real data center topology using 4 servers, and 4 switches. Those 4 servers have 8 VMs running in total. With the tests they have found out that only 14% of probability a user is able to put a service running in servers that does not suffer from correlated faults. As result, INDaaS auditing results gave them hints about weakest parts of the topology. They have also compared the RG algorithms, and concluded that the failure sampling algorithm runs much more efficiently than the minimal RG algorithm and still achieving a reasonable high accuracy. The failure sampling algorithm took 96 minutes to find 92% of all the RGs, in comparison to 1046 minutes for the minimal RG algorithm.

In my opinion the evaluation section has a lot to improve. They did not proof how we can relate the topology results with a real-case scenario. In overall, they have explained thoroughly INDaaS, but presented a shallow Evaluation section.

Risk Group

In redundant systems, a risk group (RG) is a set of components whose simultaneous failures could cause a service outage. Suppose some service A replicates critical state across independent servers B, C and D located in 3 separated racks. The intent of this 3-way redundancy configuration is to all 3 RG be the size of 3, i.e., 3 servers must fail simultaneously to cause an outage. Now imagine that these 3 racks share the same switch. You can see easily that, if the switch become unavailable, the 3 racks will also become unavailable. In this case, a common dependency introduced a RG whose failure could disable the whole service despite redundancy efforts. Also, correlated failures can be hidden not just by inadequate tools or analysts within one cloud provider, but also by non-transparent business contracts between cloud providers. One time, a storm in Dublin recently took down a local power source and its backup generator, disabling both the Amazon and Microsoft clouds in that region for hours In this paper, the authors propose a novel architecture called Independence-as-a-service or INDaaS that aims to collects and audits structural dependency data to evaluate the independence of redundant systems before failures occur.

Discovering unexpected common dependencies is challenging. Many diagnostics attempts to tolerate faults after they have happened. Most of the times, it requires human intervention. Worse, correlated faults can be hidden by not just by inadequate tools or analysts, but also by private contracts between cloud providers.

Computer Science Talks

Monday, 8 June 2015

Heading off correlated failures through Independence-as-a-Service

Risk Group

No comments:

Post a Comment