Adapting Hartigan &amp; Wong K-Means for the Efficient Clustering of Sets

Name: Adapting Hartigan & Wong K-Means for the Efficient Clustering of Sets
Keywords: Clustering sets; Hartigan and Wong K-Means; Jaccard distance; Medoids; Seeding methods; Java parallel streams, Clustering sets, Hartigan and Wong K-Means, Jaccard distance, Medoids, Seeding methods, Java parallel streams

Libero, Nigro; Franco, Cicirelli

Found an issue? Give us feedback

Open Access Biostati...arrow_drop_down

Open Access Biostatistics & Bioinformatics

Article . 2023 . Peer-reviewed

Data sources: Crossref

IRIS Cnr

Article . 2023

Full-Text: https://iris.cnr.it/request-item?handle=20.500.14243/463512&bitstreamId=ada07344-de34-4a6f-b125-6c6f23767e0e

Data sources: IRIS Cnr

CNR ExploRA

Article . 2023

Data sources: CNR ExploRA

Archivio Istituzionale dell'Università della Calabria

Article . 2023

Data sources: Archivio Istituzionale dell'Università della Calabria

Adapting Hartigan & Wong K-Means for the Efficient Clustering of Sets

descriptionPublicationkeyboard_double_arrow_right Article 25 Aug 2023Publisher:Crimson PublishersJournal:Open Access Biostatistics & Bioinformatics, volume 3 (eissn: 2578-0247,

Copyright policy )

Authors: Libero, Nigro; Franco, Cicirelli;

doi: 10.31031/oabb.2023.03.000564

handle: 20.500.14243/463512 , 20.500.11770/356359

Adapting Hartigan & Wong K-Means for the Efficient Clustering of Sets

- Summary
- Subjects
- Metrics

Abstract

This paper proposes an algorithm, named HWK-Sets, based on K-Means, suited for clustering data which are variable-sized sets of elementary items. An example of such data occurs in the analysis of medical diagnosis, where the goal is to detect human subjects who share common diseases so as to predict future illnesses from previous medical history possibly. Clustering sets is difficult because data objects do not have numerical attributes and therefore it is not possible to use the classical Euclidean distance upon which K-Means is normally based. An adaptation of the Jaccard distance between sets is used, which exploits application-sensitive information. More in particular, the Hartigan and Wong variation of K-Means is adopted, which can favor the fast attainment of a careful solution. The HWK-Sets algorithm can flexibly use various stochastic seeding techniques. Since the difficulty of calculating a mean among the sets of a cluster, the concept of a medoid is employed as a cluster representative (centroid), which always remains a data object of the application. The paper describes the HWK-Sets clustering algorithm and outlines its current implementation in Java based on parallel streams. After that, the efficiency and accuracy of the proposed algorithm are demonstrated by applying it to 15 benchmark datasets.

Related Organizations

National Research Council
Italy
Institute for high performance computing and networking
Italy
University of Calabria
Italy
National Research Council
Sri Lanka

Keywords

Clustering sets; Hartigan and Wong K-Means; Jaccard distance; Medoids; Seeding methods; Java parallel streams, Clustering sets, Hartigan and Wong K-Means, Jaccard distance, Medoids, Seeding methods, Java parallel streams

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

gold