
arXiv: 0904.1113
The k-means method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worst-case running time. In order to close the gap between practical performance and theoretical analysis, the k-means method has been studied in the model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory as the bounds are still super-polynomial in the number n of data points. In this paper, we settle the smoothed running time of the k-means method. We show that the smoothed number of iterations is bounded by a polynomial in n and 1/σ, where σis the standard deviation of the Gaussian perturbations. This means that if an arbitrary input data set is randomly perturbed, then the k-means method will run in expected polynomial time on that input set.
Full version of FOCS 2009 paper. The argument has been improved and the restriction to at least three dimensions could be dropped
Probabilistic analysis, k-Means, Computational Geometry (cs.CG), FOS: Computer and information sciences, Smoothed analyis, I.5.3, k-Means clustering, Computational Complexity (cs.CC), Clustering, Computer Science - Computational Complexity, H.3.3, 2023 OA procedure, Computer Science - Data Structures and Algorithms, k-Means method, Computer Science - Computational Geometry, Data Structures and Algorithms (cs.DS), F.2.2, F.2.2; I.5.3; H.3.3
Probabilistic analysis, k-Means, Computational Geometry (cs.CG), FOS: Computer and information sciences, Smoothed analyis, I.5.3, k-Means clustering, Computational Complexity (cs.CC), Clustering, Computer Science - Computational Complexity, H.3.3, 2023 OA procedure, Computer Science - Data Structures and Algorithms, k-Means method, Computer Science - Computational Geometry, Data Structures and Algorithms (cs.DS), F.2.2, F.2.2; I.5.3; H.3.3
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 53 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
