descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Oct 2021Embargo end date: 01 Jan 2021 English Publisher:Association for Computing Machinery (ACM)Journal:Proceedings of the VLDB Endowment, volume 15, pages 285-298 (issn: 2150-8097,

Authors: Yu, Shangdi; Wang, Yiqiu; Gu, Yan; Dhulipala, Laxman; Shun, Julian;

doi: 10.14778/3489496.3489509 , 10.48550/arxiv.2106.04727

arXiv: 2106.04727

handle: 1721.1/143883

ParChain

- Summary
- Subjects
- Metrics

Abstract

This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.

Related Organizations

Massachusetts Institute of Technology
United States
University of California System
United States
University of California, Riverside
United States

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Databases (cs.DB), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

Top 10%

Average

Top 10%

Green

Fields of Science (4) View all

natural sciences

computer and information sciences

Fields of Science

natural sciences

computer and information sciences

View all

Funded by

NSF| CAREER: Parallel Algorithms and Frameworks for Graph and Hypergraph Processing