Bias in random forest variable importance measures: Illustrations, sources and a solution

descriptionPublicationkeyboard_double_arrow_right Article , Research , Other literature type 25 Jan 2007 Germany, Austria, Austria English Publisher:Springer Science and Business Media LLCJournal:BMC Bioinformatics, volume 8 (eissn: 1471-2105,

Copyright policy )

Authors: Carolin Strobl; Anne-Laure Boulesteix; Achim Zeileis; Torsten Hothorn;

doi: 10.1186/1471-2105-8-25 , 10.5281/zenodo.13509996 , 10.5282/ubm/epub.1858 , 10.5281/zenodo.13509995

pmid: 17254353

pmc: PMC1796903

handle: 10419/31116

Bias in random forest variable importance measures: Illustrations, sources and a solution

- Summary
- Subjects
- Metrics

Abstract

(Uploaded by Plazi for the Bat Literature Project) Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

Countries

Germany, Austria, Austria

Related Organizations

WU
Austria
Wirtschaftsuniversität Wien (Vienna University of Economics and Business)
Austria
University of Erlangen-Nuremberg
Germany
University of Erlangen-Nuremberg
Germany
Technical University of Munich (TUM)
Germany

View all View all

Keywords

random forests, QH301-705.5, Computer applications to medicine. Medical informatics, Population Dynamics, R858-859.7, bats, bat, Models, Biological, 510, Bias, Chiroptera, Animalia, Computer Simulation, Biology (General), Chordata, Models, Statistical, Methodology Article, Research Report Series / Department of Statistics and Mathematics, Computational Biology, ddc:519, Genomics, Biodiversity, Data Interpretation, Statistical, variable importance, Mammalia, random forests / variable importance / Gini importance / variable selection bias, Gini importance, variable selection bias, Algorithms

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	3K
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 0.01%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 0.01%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%