Naive Bayes Classifier Based Partitioner for MapReduce

descriptionPublicationkeyboard_double_arrow_right Article 01 May 2018 United States English Publisher:Institute of Electronics, Information and Communications Engineers (IEICE)Journal:IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, volume E101.A, pages 778-786 (issn: 0916-8508, eissn: 1745-1337,

Copyright policy )

Authors: CHEN, Lei; LU, Wei; BAO, Ergude; WANG, Liqiang; XING, Weiwei; CAI, Yuanyuan;

doi: 10.1587/transfun.e101.a.778

Naive Bayes Classifier Based Partitioner for MapReduce

- Summary
- Subjects
- Metrics

Abstract

MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%. key words: MapReduce, hadoop, data locality, data skew, naive Bayes, bandwidth, job type.

Country

United States

Related Organizations

Beijing Technology and Business University
China (People's Republic of)
Beijing Jiaotong University
China (People's Republic of)
University of Central Florida
United States

Keywords

bandwidth, job type, data skew, MapReduce, hadoop, data locality, naive Bayes

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering