The authors claim that speculative execution can add additional costs to the cloud (I am talking about money), and they can degrade performance because all tasks in the end start to compete for scarce resources.
This new algorithm, ICMR, aims to make all map workers finish at the same time, so that all reducers can start at the same time. When we make all map workers finish at the same time, we are saying that any mapper can stall the beginning of the reduce phase.
In the Figure 1, Node 1 and 2 represent the private clusters and compute nodes from the clouds, respectively. In Node 1, map processing time is longer than shuffling time and in node 2, shuffling time is longer than map processing time due to the slow wide-area network.
Total running time of node $i$, $(T_i)$, is composed of total map processing time $(M_i)$, total shuffling time $(S_i)$ and overlapped time between $M_i$ and $S_i$. The total map processing time $(M_i)$ is as follows:
- $Mi = \frac{W_i}{v_i} _{\leftarrow\ data\ processing\ speed}^{\leftarrow\ input\ data}$ $, vi=min(fsi, pi)$
The total map shuffling time $(Si)$ is as follows:
- $S_i=\frac{I_i}{t_i/N_i} _{\leftarrow data\ transfer\ rate\ per\ mapper}^{\leftarrow total\ intermediate\ data}$
From the equations (1) and (2), they get the total running time of the node $i$ $(T_i)$:
- $T_i=M_i+S_i−overlapped\ time$
- $T_i \approx max(M_i, S_i)$
- $T= max(T_1, ..., T_n)$
They assume the amount of intermediate data $(I_i)$ is proportional to the amount of input data $(W_i)$.
- $I_i=\alpha \times W_i$
The total amount of input data $(W_{total})$ is the sum of the partial input data that are split to each mapper. This means that ICMR calculate the amount of data eacha mapper will process.
If all mappers finish shuffle phases at the same time, we get the following relation:
- $max ( \frac{W_i}{v_i}, \frac{\alpha W_i}{t_1/N_i})$
ICMR dynamically adjusts $W_i$ in proportion to the performance of each compute node. First, ICMR assigns equal amount of task to all compute nodes, because it has no measured values for essential metrics of $\alpha$, $v_i$ and $t_i$. After several tasks are finished, the central job tracker of Hadoop gathers initial values for the metrics, and ICMR can calculate $W_i$ for all compute nodes. During processing of jobs, the performance of network and CPU would be varied. Therefore, the performance metrics which include $\alpha$, $v_i$ , $t_i$ are measured and updated periodically.
The values of $\alpha$, $v_i$ and $t_i$ are calculated per job.
They have tested the scheduler in 2 clouds from Amazon EC2. They have used four instances from U.S. west and east each, and the total number of compute nodes was eight. They claim that ICMR scheduler reduced total map and shuffling time by about 19%, because all nodes had almost the same performance. The difference was due to the limitation of the network bandwidth across the 2 clouds.
Moreover, because the processing speed of node $v_i$ in the ICMR scheduling model measures not a CPU clock rate but data processing speed, the total running time of map processing time was also decreased.