CoLoc: Distributed data and container colocation for data-intensive applications

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Dec 2016Publisher:IEEEJournal:2016 IEEE International Conference on Big Data (Big Data)

Authors: Thomas Renner; Lauritz Thamsen; Odej Kao;

doi: 10.1109/bigdata.2016.7840954

CoLoc: Distributed data and container colocation for data-intensive applications

- Summary
- Metrics

Abstract

The performance of scalable analytic frameworks supporting data-intensive parallel applications often depends significantly on the time it takes to read input data. Therefore, existing frameworks like Spark and Flink try to achieve a high degree of data locality by scheduling tasks on nodes where the input data resides. However, the set of nodes running a job and its tasks is chosen by a cluster resource management system like YARN, which schedules containers without taking the location of data into account. Yet, the scheduling of the frameworks is restricted to the set of nodes the containers are running on. At the same time, many jobs in productive clusters are recurring with predictable characteristics. For these jobs, it is possible to plan in advance on which nodes to place a job's input data and execution containers. In this paper we present CoLoc, a lightweight data and container scheduling assistant for recurring data-intensive analytic jobs. CoLoc allows users to define related files that serve as input for the same job. It colocates related files on a set of nodes and offers this scheduling hint to the cluster manager to also place the jobs container on these nodes. The main advantage of CoLoc is a reduction of network transfers due to a higher data locality and locally performed operators like grouping or joining two or more datasets. We implement CoLoc on Hadoop YARN and HDFS, then evaluate it on a 40 node cluster using workloads based on Apache Flink and the TPC-H benchmark suite. Compared to YARN's default scheduler and HDFS's block placement scheduler, CoLoc reduces the execution time up to 35% for the tested data-intensive workloads.

Related Organizations

Technical University of Berlin
Germany
Technical University of Berlin
Germany

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	8
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

8

Top 10%

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now