Random forest missing data algorithms

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 13 Jun 2017Embargo end date: 01 Jan 2017 English Publisher:WileyJournal:Statistical Analysis and Data Mining: The ASA Data Science Journal, volume 10, pages 363-377 (issn: 1932-1864, eissn: 1932-1872,

Copyright policy )Funded by:NIH | RF-SRC: A Unified Data To...

Authors: Fei Tang; Hemant Ishwaran;

doi: 10.1002/sam.11348 , 10.48550/arxiv.1701.05305

pmid: 29403567

pmc: PMC5796790

arXiv: 1701.05305

Random forest missing data algorithms

- Summary
- Subjects
- Metrics

Abstract

Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting—the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, splitting (random, univariate, Statistics, imputation, missingness, Machine Learning (stat.ML), unsupervised), Computer science, machine learning, multivariate, Statistics - Machine Learning, correlation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	625
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 0.1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%