Powered by OpenAIRE graph
Found an issue? Give us feedback
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

REMOLD: An Efficient Model-Based Clustering Algorithm for Large Datasets with Spark

Authors: Mingfei Liang; Qingyong Li; Yangli-ao Geng; Jianzhu Wang; Zhi Wei;

REMOLD: An Efficient Model-Based Clustering Algorithm for Large Datasets with Spark

Abstract

Density-based clustering algorithms have the distinctive advantage of discovering arbitrarily shaped clusters, but they usually require a procedure to compute the distance between every pair of data points, and this procedure is prohibitive for large datasets since it has quadratic computation complexity. In this paper, we propose a new distributed clustering algorithm, named REstore MOdel with Local Density estimation (REMOLD). Firstly, REMODL applies a balanced partitioning method to evenly divide an large dataset based on Local Sensitive Hashing (LSH). Then, it locally clusters each partition of the dataset, and uses a Gaussian model to represent each local cluster based on the observation that the density distribution of each local cluster shares similar shape with Gaussian distribution. Finally, these models are aggregated on a server where REMOLD restores global clusters based on these local Gaussian models. More specifically, model connection, which measures the density connectivity between two models, are defined to merge local models with an optimized procedure. In this aggregation, REMOLD requires low cost of network transmission for local Gaussian models, since the number of Gaussian models is often less than that of core objects for each partition. We evaluate REMOLD on three synthetic datasets and three real-world datasets on Spark, and the experiment results demonstrate that REMOLD is efficient and effective to find out clusters with complex shapes and it outperforms the established methods.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    4
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
4
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!