Wednesday, 13 July 2016

Speculative Execution in MapReduce

MapReduce is a widely used parallel computing framework for large scale data processing. The two major performance metrics in MapReduce are job execution time and cluster throughput.

A job can be seriously impacted by straggler machines — machines on which tasks take an unusually long time to finish. Speculative execution is a common approach for dealing with the straggler problem by simply backing up those slow running tasks on alternative machines.

Hadoop is a widely used open-source implementation of MapReduce. The original speculative execution strategy used in Hadoop-0.20 (we call it Hadoop-Original) simply identifies a task as a straggler when the task’s progress falls behind the average progress of all tasks by a fixed gap.

Multiple speculative execution strategies have been proposed, but they have some pitfalls:
  1. Use average progress rate to identify slow tasks while in reality the progress rate can be unstable and misleading, 
  2. Cannot appropriately handle the situation when there exists data skew among the tasks, 
  3. Do not consider whether backup tasks can finish earlier when choosing backup worker nodes.
 We can categorize the causes for stragglers into internal and external reasons. Internal reasons can be solved by the MapReduce service, while external reasons cannot.

The internal reasons for a straggler are the heterogeneous resources capacity of worker nodes, and the existence of resource competition from other tasks running on the same worker node.

But MapReduce cannot deal with resoure competition due to co-hosted applications, input data skew, faulty hardware, and remote input and output source being too slow.

Hadooop's speculative execution is driven by two factors: SPECULATIVE_LAG and SPECULATIVE_GAP.

Hadoop decides to launch stragglers after a running for a time period greater that SPECULATIVE_LAG and when the progress score is less than the average score minus SPECULATIVE_GAP.

The default value of SPECULATIVE_LAG is 60 seconds and SPECULATIVE_GAP is 0.2.

No comments:

Post a Comment