Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Mar 2019Embargo end date: 01 Jan 2018 English Publisher:Association for Computing Machinery (ACM)Journal:Proceedings of the VLDB Endowment, volume 12, pages 766-778 (issn: 2150-8097,

Copyright policy )

Authors: Ceccarello, Matteo; Pietracaprina, Andrea; Pucci, Geppino;

doi: 10.14778/3317315.3317319 , 10.48550/arxiv.1802.09205

arXiv: 1802.09205

handle: 11577/3299392

Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially

- Summary
- Subjects
- Related research
  (7)
- Metrics

Abstract

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k -center variant which, given a set S of points from some metric space and a parameter k < | S |, requires to identify a subset of k centers in S minimizing the maximum distance of any point of S from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of S (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed ϵ > 0, the algorithms yield solutions whose approximation ratios are a mere additive term ϵ away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D . These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones.

Related Organizations

University of Padua
Italy
IT University of Copenhagen
Denmark
Department of Information Engineering - DEI, University of Padua
Italy
IT University

Keywords

FOS: Computer and information sciences, coreset, streaming algorithms, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, large datasets, Data Structures and Algorithms (cs.DS), MapReduce algorithms, Distributed, Parallel, and Cluster Computing (cs.DC), k-center clustering, 004

7 Research products, page 1 of 1

On the moments of ladder epochs for driftless random walks
1994IsAmongTopNSimilarDocuments
Apsidal Motion Test: Confrontation Between Theory and Observations
1993IsAmongTopNSimilarDocuments
Two-way one-counter automata accepting bounded languages
1994IsAmongTopNSimilarDocuments
Perirenalfat thickness is associated with bone turnover markers and bone mineral density in postmenopausal women with type 2 diabetes mellitus
2022IsAmongTopNSimilarDocuments
Follow-up Evaluation of Association between Weight Changes, Metabolic, and Hormonal Outcomes in Children – a Single-center Pilot Study
2020IsAmongTopNSimilarDocuments
Logarithmic asymptotics for steady-state tail probabilities in a single-server queue
1994IsAmongTopNSimilarDocuments
coreset-clustering software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	31
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

31

Top 10%

Green

bronze

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially

Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially

7 Research products, page 1 of 1

On the moments of ladder epochs for driftless random walks

Apsidal Motion Test: Confrontation Between Theory and Observations

Two-way one-counter automata accepting bounded languages

Perirenalfat thickness is associated with bone turnover markers and bone mineral density in postmenopausal women with type 2 diabetes mellitus

Follow-up Evaluation of Association between Weight Changes, Metabolic, and Hormonal Outcomes in Children &ndash; a Single-center Pilot&nbsp;Study

Logarithmic asymptotics for steady-state tail probabilities in a single-server queue

coreset-clustering software on GitHub

Follow-up Evaluation of Association between Weight Changes, Metabolic, and Hormonal Outcomes in Children – a Single-center Pilot Study