Unsupervised random forests

descriptionPublicationkeyboard_double_arrow_right Article 05 Feb 2021 English Publisher:WileyJournal:Statistical Analysis and Data Mining: The ASA Data Science Journal, volume 14, pages 144-167 (issn: 1932-1864, eissn: 1932-1872,

Copyright policy )

Authors: Alejandro Mantero; Hemant Ishwaran;

doi: 10.1002/sam.11498

pmid: 33833846

pmc: PMC8025042

Unsupervised random forests

- Summary
- Subjects
- Metrics

Abstract

AbstractsidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi‐institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.

Related Organizations

Miami University
United States
University System of Ohio
United States

Keywords

sidclustering, Statistics, impurity, unsupervised learning, staggered interaction data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	38
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%