descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 01 Sep 2016 English Publisher:Springer Science and Business Media LLCJournal:BMC Bioinformatics, volume 17 (eissn: 1471-2105,

Authors: Huang, Barbara F; Boutros, Paul C;

doi: 10.1186/s12859-016-1228-x

pmid: 27586051

pmc: PMC5009551

handle: 1807/83117

The parameter sensitivity of random forests

- Summary
- Subjects
- Related research
  (18)
- Metrics

Abstract

AbstractBackgroundThe Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here.ResultsWe examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinctp/nratios: sequencing summary statistics (lowp/n) and microarray-derived data (highp/n). Here,p,refers to the number of variables and,n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters.ConclusionsParameter performance demonstrated wide variability on both low and highp/ndata. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings.

Related Organizations

University of Toronto
Canada
University of California, San Francisco
United States
Ontario Institute for Cancer Research
Canada

Keywords

Optimization, Ensemble methods, Bioinformatics, 610, Mathematical sciences, Bioengineering, Parameterization, Microarray, Biochemistry, Mathematical Sciences, Machine Learning, Computational biology, Theoretical, 2.5 Research design and methodologies (aetiology), Models, Information and Computing Sciences, Aetiology, Machine-learning, Molecular Biology, Methodology Article, Computational Biology, Reproducibility of Results, Biological Sciences, Models, Theoretical, Computer Science Applications, Biological sciences, Information and computing sciences, SeqControl, Algorithms, Random forest

18 Research products, page 1 of 2

Additional file 17: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 14: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 10: of The parameter sensitivity of random forests
2016IsSupplementedBy
The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 9: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 12: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 13: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 4: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 2: of The parameter sensitivity of random forests
2016IsSupplementedBy
Additional file 3: of The parameter sensitivity of random forests
2016IsSupplementedBy

chevron_left
1
2
chevron_right

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	133
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%