PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

Name: PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search
Keywords: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

Yu, Shangdi; Engels, Joshua; Huang, Yihao; Shun, Julian

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1137/1.9781...

Part of book or chapter of book . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2023 English Publisher:Society for Industrial & Applied Mathematics (SIAM)Funded by:NSF | CAREER: Parallel Algorith..., NSF | Collaborative Research: P...

Authors: Yu, Shangdi; Engels, Joshua; Huang, Yihao; Shun, Julian;

doi: 10.1137/1.9781611978759.1 , 10.48550/arxiv.2312.03940

arXiv: 2312.03940

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.

Related Organizations

Massachusetts Institute of Technology
United States
MIT
MIT
Finland

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

1 Research products, page 1 of 1

PECANN-DPC software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Funded by

NSF| CAREER: Parallel Algorithms and Frameworks for Graph and Hypergraph Processing, NSF| Collaborative Research: PPoSS: LARGE: General-Purpose Scalable Technologies for Fundamental Graph Problems

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

1 Research products, page 1 of 1

PECANN-DPC software on GitHub