Fast K-Means Algorithm Clustering

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 31 Jul 2011Embargo end date: 01 Aug 2011Publisher:Academy and Industry Research Collaboration Center (AIRCC)Journal:International journal of Computer Networks & Communications, volume 3, pages 17-31 (issn: 0975-2293,

Copyright policy )

Authors: Raied Salman; Vojislav Kecman; Qi Li 0035; Robert Strack; Erik Test;

doi: 10.5121/ijcnc.2011.3402 , 10.48550/arxiv.1108.1351

arXiv: 1108.1351

Fast K-Means Algorithm Clustering

- Summary
- Subjects
- Metrics

Abstract

k-means has recently been recognized as one of the best algorithms for clustering unsupervised data. Since k-means depends mainly on distance calculation between all data points and the centers, the time cost will be high when the size of the dataset is large (for example more than 500millions of points). We propose a two stage algorithm to reduce the time cost of distance calculation for huge datasets. The first stage is a fast distance calculation using only a small portion of the data to produce the best possible location of the centers. The second stage is a slow distance calculation in which the initial centers used are taken from the first stage. The fast and slow stages represent the speed of the movement of the centers. In the slow stage, the whole dataset can be used to get the exact location of the centers. The time cost of the distance calculation for the fast stage is very low due to the small size of the training data chosen. The time cost of the distance calculation for the slow stage is also minimized due to small number of iterations. Different initial locations of the clusters have been used during the test of the proposed algorithms. For large datasets, experiments show that the 2-stage clustering method achieves better speed-up (1-9 times).

16 pages, Wimo2011; International Journal of Computer Networks & Communications (IJCNC) Vol.3, No.4, July 2011

Related Organizations

Virginia Commonwealth University
United States

Keywords

FOS: Computer and information sciences, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	12
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

12

Top 10%

Average

Green

gold

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering