Monday 20 July 2015

Revisiting Memory Errors in Large-Scale Production Data Centers

Several work exists to talk about memory errors in Large Scale production data centers1. This paper is a continuation of a previous study and it took 14 months long to be completed and it was performed in the Facebook servers. With this study, the authors:

  1. analyze new DRAM failure trends in modern devices and workloads;
  2. develop a model for examining the memory failure rates of systems with different characteristics;
  3. describe and perform the first analysis of a large-scale implementation of a software technique proposed in prior work to reduce DRAM error rate.

During the analysis of the logs, they have noticed that:

  1. The number of memory errors per machine follows a Pareto distribution, with decreasing hazard rate. The memory errors tend to occur more frequently in the servers that have crashed more times.
  2. Non-DRAM memory failures, such as those in the memory controller and the memory channel, are the source of the majority of errors that occur. This memory errors can cause a system to crash.
  3. DRAM failure rates increase with newer cell fabrication technologies. As bigger the size of memory is, higher is the failure rate. For instance, 4GB of DRAM have 1. 8x failure rate that 2GB.
  4. The number of DIMM chips and transfer width affect error rate.
  5. The failure rate can vary according to the work load.

They have defined the errors that happen in memory as correctable or uncorrectable errors. Correctable errors (CE) are errors that can be corrected by ECC mechanism, uncorrectable errors (UCE) cannot. CE happen more frequently than UCE, and it can degrade the system performance. UCE can make a system crash. They also have noticed that the more errors a server have, higher is the probability that it will get another error, following the Pareto law. > According to the Schroeder conjecture, if a server has suffered lots of CE, maybe it is better to change the DIMM than waiting for the first UCE. This will reduce the likelihood of uncorrectable errors.

Most of the CE that happen in the socket and in the channel affect a large amount of memory, but they have noticed that these errors only happened in a small fraction of servers. It is the spurious errors that affect most of the servers and it is necessary to have more effective techniques for detecting and reducing the reliability impact of weak cells in the memory.

They have tested the servers with different type of workload. They have split the resource requirements that builds a workload in Processor, Memory, and Storage. On each requirement, they have set a level of Low, Medium, and High. E.g., the Hadoop workload uses high processor and storage, and medium memory.

In the end, they have developed a model for memory failures where they claim that using lower density DIMMs and fewer processors can reduce failure rates by up to 57. 7.

Based on this paper, we can conclude that the failure rate will also affect Hadoop framework, but there is one question related to Hadoop they do not answer, and maybe it is out of scope of this paper. Will 1000 commodities machines with Hadoop framework installed have the same failure rate as the framework running in 1000 big servers? Maybe this is a question that open doors for an interesting study.

Notes:

DRAM: Dynamic random-access memory (DRAM) is a type of random-access memory that stores each bit of data in a separate capacitor within an integrated circuit. Compared with SRAM, only one transistor and a capacitor are required per bit, compared to four or six transistors in SRAM. This allows to reach high capacity.