Tuesday, 1 September 2015

Improving Hadoop Performance in Intercloud Environments

In this short paper published in SIGMETRICS'11, the authors propose a new scheduler to improve the performance without the help of speculative execution in intercloud environments. They have set a single mapreduce runtime to work in two clouds, and they have put the data in only one of the cloud (private cloud). Mappers are launched in both clouds, which means that they can read data locally or remotely.
The authors claim that speculative execution can add additional costs to the cloud (I am talking about money), and they can degrade performance because all tasks in the end start to compete for scarce resources.
This new algorithm, ICMR, aims to make all map workers finish at the same time, so that all reducers can start at the same time. When we make all map workers finish at the same time, we are saying that any mapper can stall the beginning of the reduce phase.
In the Figure 1, Node 1 and 2 represent the private clusters and compute nodes from the clouds, respectively. In Node 1, map processing time is longer than shuffling time and in node 2, shuffling time is longer than map processing time due to the slow wide-area network.
Total running time of node $i$, $(T_i)$, is composed of total map processing time $(M_i)$, total shuffling time $(S_i)$ and overlapped time between $M_i$ and $S_i$. The total map processing time $(M_i)$ is as follows:
  1. $Mi = \frac{W_i}{v_i} _{\leftarrow\ data\ processing\ speed}^{\leftarrow\ input\ data}$ $, vi=min(fsi, pi)$
where $W_i$ is the amount of input data for node $i$, and $vi$ is the processing speed of node $i$ (not CPU clock). The processing speed $vi$ is determined by two performance metrics, which are data reading rate from storage $(fsi)$ and the computing power of node $i$ $(p_i)$. In compute nodes from the cloud - the ones that maps need to read data remotely, $fsi$ (data reading rate) has a lower value than $p_i$ (computing power).
The total map shuffling time $(Si)$ is as follows:
  1. $S_i=\frac{I_i}{t_i/N_i} _{\leftarrow data\ transfer\ rate\ per\ mapper}^{\leftarrow total\ intermediate\ data}$
where $I_i$ is the total amount of intermediate data produced at node $i$, $t_i$ is the data transfer rate to reduce workers, and $N_i$ is the number of map workers in node $i$ $(Ni)$. The $t_i/N_i$ will give the data transfer rate to reducers divided by the number of mappers in that node $N_i$.
From the equations (1) and (2), they get the total running time of the node $i$ $(T_i)$:
  1. $T_i=M_i+S_i−overlapped\ time$
If the total running time is not short enough, the overlapped time is almost same as $M_i$ or $S_i$ depending on the execution environment. Thus, $T_i$ can be represented as follows:
  1. $T_i \approx max(M_i, S_i)$
Finally we get a total running time for the whole job $(T)$ as:
  1. $T= max(T_1, ..., T_n)$
The objective of the ICMR is to finish the shuffle phases of all map tasks at the same time. By doing this, idle resources are reduced, and the reduce phase can start as soon as possible.
They assume the amount of intermediate data $(I_i)$ is proportional to the amount of input data $(W_i)$.
  1. $I_i=\alpha \times W_i$
where $\alpha$ is measured and updated whenever one map task is completed. In other words, $\alpha$ is the ratio between the generated input data and the total input data.
The total amount of input data $(W_{total})$ is the sum of the partial input data that are split to each mapper. This means that ICMR calculate the amount of data eacha mapper will process.
If all mappers finish shuffle phases at the same time, we get the following relation:
  1. $max ( \frac{W_i}{v_i}, \frac{\alpha W_i}{t_1/N_i})$
The equation (7) just tells that the time that it takes to execute the whole job it will be based on the maximum time that it took to process the mappers and to shuffle the data.
ICMR dynamically adjusts $W_i$ in proportion to the performance of each compute node. First, ICMR assigns equal amount of task to all compute nodes, because it has no measured values for essential metrics of $\alpha$, $v_i$ and $t_i$. After several tasks are finished, the central job tracker of Hadoop gathers initial values for the metrics, and ICMR can calculate $W_i$ for all compute nodes. During processing of jobs, the performance of network and CPU would be varied. Therefore, the performance metrics which include $\alpha$, $v_i$ , $t_i$ are measured and updated periodically.
The values of $\alpha$, $v_i$ and $t_i$ are calculated per job.
They have tested the scheduler in 2 clouds from Amazon EC2. They have used four instances from U.S. west and east each, and the total number of compute nodes was eight. They claim that ICMR scheduler reduced total map and shuffling time by about 19%, because all nodes had almost the same performance. The difference was due to the limitation of the network bandwidth across the 2 clouds.
Moreover, because the processing speed of node $v_i$ in the ICMR scheduling model measures not a CPU clock rate but data processing speed, the total running time of map processing time was also decreased.