descriptionPublicationkeyboard_double_arrow_right Article 04 Apr 2016Publisher:ACMJournal:Proceedings of the 31st Annual ACM Symposium on Applied Computing

Authors: Frank Gouineau; Tom Landry; Thomas Triplet;

doi: 10.1145/2851613.2851643

PatchWork, a scalable density-grid clustering algorithm

- Summary
- Metrics

Abstract

Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed, many are complex and do not scale well as more data become available, making then inadequate to analyze very large datasets. In addition, many clustering algorithms are sequential, thus inherently difficult to parallelize. We propose PatchWork, a novel clustering algorithm to address those issues. PatchWork is a distributed density clustering algorithm with linear computational complexity and linear horizontal scalability. It presents several desirable characteristics in knowledge discovery, in particular, it does not require a priori the number of clusters to identify, and offers a natural protection against outliers and noise. In addition, PatchWork makes it possible to discover spatially large clusters instead of dense clusters only. PatchWork relies on the map/reduce paradigm to parallelize computations and was implemented using Apache Spark, the distributed computation framework. As a result, PatchWork can cluster a billion points in a few minutes only, a 40x improvement over the distributed implementation of k-means in Spark MLLib.

Related Organizations

Computer Research Institute of Montréal
Canada

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	8
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

bronze

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering