publication . Other literature type . Article . Research . 2018

Random forest versus logistic regression: a large-scale benchmark experiment.

Couronné, Raphael; Probst, Philipp; Boulesteix, Anne-Laure;
  • Published: 01 Jul 2018
  • Publisher: (:unav)
  • Country: Germany
Abstract
Background and goal The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. Results In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major...
Subjects
free text keywords: Technische Reports, Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, 510, Research Article, Logistic regression, Classification, Prediction, Comparison study, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5, Random forest, Databases as Topic, Statistics, Binary classification, Brier score, Regression, Benchmarking, Biology, Mean difference, ddc:510, ddc:610
Related Organizations
41 references, page 1 of 3

Shmueli, G. To explain or to predict?. Stat Sci. 2010; 25: 289-310 [OpenAIRE] [DOI]

Breiman, L. Random forests. Mach Learn. 2001; 45 (1): 5-32 [OpenAIRE] [DOI]

Liaw, A, Wiener, M. Classification and regression by randomforest. R News. 2002; 2: 18-22

4 Probst P. tuneRanger: Tune Random Forest of the ’ranger’ Package. 2018. R package version 0.1.

Boulesteix, A-L, Lauer, S, Eugster, MJ. A plea for neutral comparison studies in computational sciences. PLoS ONE. 2013; 8 (4): 61562 [OpenAIRE] [DOI]

De Bin, R, Janitza, S, Sauerbrei, W, Boulesteix, A-L. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016; 72: 272-80 [OpenAIRE] [PubMed] [DOI]

7 Boulesteix A-L, De Bin R, Jiang X, Fuchs M. IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Models Med. 2017. 10.1155/2017/7691937.

Boulesteix, A-L, Bender, A, Bermejo, JL, Strobl, C. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2012; 13 (3): 292-304 [OpenAIRE] [PubMed] [DOI]

Boulesteix, A-L, Schmid, M. Machine learning versus statistical modeling. Biom J. 2014; 56 (4): 588-93 [OpenAIRE] [PubMed] [DOI]

10 Boulesteix A-L, Janitza S, Hornung R, Probst P, Busen H, Hapfelmeier A. Making complex prediction rules applicable for readers: Current practice in random forest literature and recommendations. Biometrical J. 2016. In press.

Boulesteix, A-L, Wilson, R, Hapfelmeier, A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol. 2017; 17 (1): 138 [OpenAIRE] [PubMed] [DOI]

Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29: 1189-232 [OpenAIRE] [DOI]

Hothorn, T, Hornik, K, Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006; 15: 651-74 [DOI]

Strobl, C, Boulesteix, A-L, Zeileis, A, Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007; 8: 25 [OpenAIRE] [PubMed] [DOI]

Geurts, P, Ernst, D, Wehenkel, L. Extremely randomized trees. Mach Learn. 2006; 63 (1): 3-42 [OpenAIRE] [DOI]

41 references, page 1 of 3
Abstract
Background and goal The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. Results In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major...
Subjects
free text keywords: Technische Reports, Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, 510, Research Article, Logistic regression, Classification, Prediction, Comparison study, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5, Random forest, Databases as Topic, Statistics, Binary classification, Brier score, Regression, Benchmarking, Biology, Mean difference, ddc:510, ddc:610
Related Organizations
41 references, page 1 of 3

Shmueli, G. To explain or to predict?. Stat Sci. 2010; 25: 289-310 [OpenAIRE] [DOI]

Breiman, L. Random forests. Mach Learn. 2001; 45 (1): 5-32 [OpenAIRE] [DOI]

Liaw, A, Wiener, M. Classification and regression by randomforest. R News. 2002; 2: 18-22

4 Probst P. tuneRanger: Tune Random Forest of the ’ranger’ Package. 2018. R package version 0.1.

Boulesteix, A-L, Lauer, S, Eugster, MJ. A plea for neutral comparison studies in computational sciences. PLoS ONE. 2013; 8 (4): 61562 [OpenAIRE] [DOI]

De Bin, R, Janitza, S, Sauerbrei, W, Boulesteix, A-L. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016; 72: 272-80 [OpenAIRE] [PubMed] [DOI]

7 Boulesteix A-L, De Bin R, Jiang X, Fuchs M. IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Models Med. 2017. 10.1155/2017/7691937.

Boulesteix, A-L, Bender, A, Bermejo, JL, Strobl, C. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2012; 13 (3): 292-304 [OpenAIRE] [PubMed] [DOI]

Boulesteix, A-L, Schmid, M. Machine learning versus statistical modeling. Biom J. 2014; 56 (4): 588-93 [OpenAIRE] [PubMed] [DOI]

10 Boulesteix A-L, Janitza S, Hornung R, Probst P, Busen H, Hapfelmeier A. Making complex prediction rules applicable for readers: Current practice in random forest literature and recommendations. Biometrical J. 2016. In press.

Boulesteix, A-L, Wilson, R, Hapfelmeier, A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol. 2017; 17 (1): 138 [OpenAIRE] [PubMed] [DOI]

Friedman, JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29: 1189-232 [OpenAIRE] [DOI]

Hothorn, T, Hornik, K, Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006; 15: 651-74 [DOI]

Strobl, C, Boulesteix, A-L, Zeileis, A, Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007; 8: 25 [OpenAIRE] [PubMed] [DOI]

Geurts, P, Ernst, D, Wehenkel, L. Extremely randomized trees. Mach Learn. 2006; 63 (1): 3-42 [OpenAIRE] [DOI]

41 references, page 1 of 3
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Other literature type . Article . Research . 2018

Random forest versus logistic regression: a large-scale benchmark experiment.

Couronné, Raphael; Probst, Philipp; Boulesteix, Anne-Laure;