Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 22 Nov 2024Embargo end date: 01 Jan 2023 English Publisher:Oxford University Press (OUP)Journal:Briefings in Bioinformatics, volume 26 (issn: 1467-5463, eissn: 1477-4054,

Copyright policy )Funded by:AKA | Biomarker Discovery for P..., AKA | Biomarker Discovery for P..., AKA | Biomarker Discovery for P...

Authors: Luca Cattelani; Vittorio Fortino;

doi: 10.1093/bib/bbae674 , 10.48550/arxiv.2312.16624

pmid: 39737563

pmc: PMC11684899

arXiv: 2312.16624

Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Abstract The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets—potential biomarker solutions—to fine-tune a machine learning model’s performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.

Related Organizations

University of Eastern Finland
Finland
Institute for Biomedicine
Italy

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computational Biology, Quantitative Biology - Quantitative Methods, Machine Learning (cs.LG), Machine Learning, Neoplasms, FOS: Biological sciences, Biomarkers, Tumor, Problem Solving Protocol, Humans, Algorithms, Quantitative Methods (q-bio.QM)

1 Research products, page 1 of 1

BIODAI software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average