Cluster Forests

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 01 Oct 2013Embargo end date: 01 Jan 2011 English Publisher:Elsevier BVJournal:Computational Statistics & Data Analysis, volume 66, pages 178-192 (issn: 0167-9473,

Copyright policy )

Authors: Donghui Yan; Aiyou Chen; Michael I. Jordan;

doi: 10.1016/j.csda.2013.04.010 , 10.48550/arxiv.1104.2930

arXiv: 1104.2930

Cluster Forests

- Summary
- Subjects
- Metrics

Abstract

With inspiration from Random Forests (RF) in the context of classification, a new clustering ensemble method---Cluster Forests (CF) is proposed. Geometrically, CF randomly probes a high-dimensional data cloud to obtain "good local clusterings" and then aggregates via spectral clustering to obtain cluster assignments for the whole dataset. The search for good local clusterings is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF. Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors. Theoretical analysis reveals that the kappa measure makes it possible to grow the local clustering in a desirable way---it is "noise-resistant". A closed-form expression is obtained for the mis-clustering rate of spectral clustering under a perturbation model, which yields new insights into some aspects of spectral clustering.

23 pages, 6 figures

Related Organizations

University of California, San Francisco
United States
University of California System
United States
Bell Labs
United States

Keywords

FOS: Computer and information sciences, spectral clustering, Computer Science - Machine Learning, Classification and discrimination; cluster analysis (statistical aspects), Learning and adaptive systems in artificial intelligence, Machine Learning (stat.ML), stochastic block model, Machine Learning (cs.LG), cluster ensemble, Methodology (stat.ME), feature selection, Statistics - Machine Learning, high-dimensional data analysis, Computational methods for problems pertaining to statistics, Statistics - Methodology

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	39
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

39

Top 10%

Green

bronze

Fields of Science (4) View all

Fields of Science