The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling

Name: The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling
Creator: Hasan Yıldırım
Keywords: Makine Öğrenme (Diğer), Machine Learning Algorithms, Denetimli Öğrenme, machine learning;multicollinearity;feature selection;artificial intelligence;collinearity, Supervised Learning, Makine Öğrenmesi Algoritmaları, Machine Learning (Other)

Hasan Yıldırım

Found an issue? Give us feedback

Academic Platform Jo...arrow_drop_down

Academic Platform Journal of Engineering and Smart Systems

Article . 2024 . Peer-reviewed

Data sources: Crossref

TÜBİTAK ULAKBİM DergiPark

Article . 2023

Data sources: TÜBİTAK ULAKBİM DergiPark

The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling

descriptionPublicationkeyboard_double_arrow_right Article 25 Sep 2024Publisher:Academic Platform Journal of Engineering and Smart SystemsJournal:Academic Platform Journal of Engineering and Smart Systems, volume 12, pages 68-80 (eissn: 2822-2385,

Copyright policy )

Authors: Hasan Yıldırım;

doi: 10.21541/apjess.1371070

The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling

- Summary
- Subjects
- Metrics

Abstract

Background: The data extracted from various fields inherently consists of extremely correlated measurements in parallel with the exponential increase in the size of the data that need to be interpreted owing to the technological advances. This problem, called the multicollinearity, influences the performance of both statistical and machine learning algorithms. Statistical models proposed as a potential remedy to this problem have not been sufficiently evaluated in the literature. Therefore, a comprehensive comparison of statistical and machine learning models is required for addressing the multicollinearity problem. Methods: Statistical models (including Ridge, Liu, Lasso and Elastic Net regression) and the eight most important machine learning algorithms (including Cart, Knn, Mlp, MARS, Cubist, Svm, Bagging and XGBoost) are comprehensively compared by using two different healthcare datasets (including Body Fat and Cancer) having multicollinearity problem. The performance of the models is assessed through cross validation methods via root mean square error, mean absolute error and r-squared criteria. Results: The results of the study revealed that statistical models outperformed machine learning models in terms of root mean square error, mean absolute error and r-squared criteria in both training and testing performance. Particularly the Liu regression often achieved better relative performance (up to 7.60% to 46.08% for Body Fat data set and up to 1.55% to 21.53% for Cancer data set on training performance and up to 1.56% to 38.08% for Body Fat data set and up to 3.50% to 23.29% for Cancer data set on testing performance) among regression methods as well as compared to machine algorithms. Conclusions: Liu regression is mostly disregarded in the machine learning literature, but since it outperforms the most powerful and widely used machine learning algorithms, it appears to be a promising tool in almost all fields, especially for regression-based studies including data with multicollinearity problem.

Related Organizations

Karamanoğlu Mehmetbey University
Turkey

Keywords

Makine Öğrenme (Diğer), Machine Learning Algorithms, Denetimli Öğrenme, machine learning;multicollinearity;feature selection;artificial intelligence;collinearity, Supervised Learning, Makine Öğrenmesi Algoritmaları, Machine Learning (Other)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	13
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

13

Top 10%

Average

Top 10%

Related to Research communities

Cancer Research

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now