
arXiv: 1104.2930
With inspiration from Random Forests (RF) in the context of classification, a new clustering ensemble method---Cluster Forests (CF) is proposed. Geometrically, CF randomly probes a high-dimensional data cloud to obtain "good local clusterings" and then aggregates via spectral clustering to obtain cluster assignments for the whole dataset. The search for good local clusterings is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF. Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors. Theoretical analysis reveals that the kappa measure makes it possible to grow the local clustering in a desirable way---it is "noise-resistant". A closed-form expression is obtained for the mis-clustering rate of spectral clustering under a perturbation model, which yields new insights into some aspects of spectral clustering.
23 pages, 6 figures
FOS: Computer and information sciences, spectral clustering, Computer Science - Machine Learning, Classification and discrimination; cluster analysis (statistical aspects), Learning and adaptive systems in artificial intelligence, Machine Learning (stat.ML), stochastic block model, Machine Learning (cs.LG), cluster ensemble, Methodology (stat.ME), feature selection, Statistics - Machine Learning, high-dimensional data analysis, Computational methods for problems pertaining to statistics, Statistics - Methodology
FOS: Computer and information sciences, spectral clustering, Computer Science - Machine Learning, Classification and discrimination; cluster analysis (statistical aspects), Learning and adaptive systems in artificial intelligence, Machine Learning (stat.ML), stochastic block model, Machine Learning (cs.LG), cluster ensemble, Methodology (stat.ME), feature selection, Statistics - Machine Learning, high-dimensional data analysis, Computational methods for problems pertaining to statistics, Statistics - Methodology
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 39 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
