
doi: 10.1109/icac.2016.52
The Apache Hadoop framework (currently known as YARN) is a widely used open-source implementation of MapReduce (MR). Manual tuning of Hadoop performance is hard and time-consuming so several self-tuning approaches have been proposed. This paper proposes an approach that avoids problems of previous self-tuning approaches based on performance models or resource usage, namely 1) need for a time-consuming training phase, typically offline, 2) unsuitability for Hadoop environments with concurrently running MR jobs, and 3) need for modification of the Hadoop framework itself. The proposed approach uses a fuzzyprediction controller for self-optimization of the number of concurrent MR jobs. The fuzzy-prediction controller learns from past and current resource usage of MR jobs and from the number of concurrent tasks. It both uses and constructs rules in real time to predict the resource usage and the number of concurrent tasks. It does not require offline training or any modification of either the MR jobs or the Hadoop framework. The predicted values are used to dynamically control the number of concurrent ApplicationMasters (AMs) (i.e., MR jobs in RUNNING state). Experimental evaluation of the proposed approach on a 7-node cluster (1 master node and 6 slave nodes) running 30-job sequences combining three different types of MR jobs (Terasort, Grep and Wordcount) showed up to 29% performance improvement over Hadoop default configurations. The new approach improves the aggregate performance of MR jobs by adjusting a single YARN parameter.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 9 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
