Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Software . 2025
Data sources: ZENODO
ZENODO
Software . 2025
Data sources: Datacite
ZENODO
Software . 2025
Data sources: Datacite
versions View all 2 versions
addClaim

A cloud-based proteomics ML workflow for biomarker discovery

Authors: Zhou, Yuhan; Maurya, Anand; Deng, Yun; Fletcher, Micah; Ren, Connia; Taylor, Avigail;

A cloud-based proteomics ML workflow for biomarker discovery

Abstract

Abstract Background: High-throughput affinity and mass-spectrometry-based proteomic studies of large clinical cohorts generate high-dimensional proteomic data useful for accelerated disease biomarker discovery. A powerful approach to realizing the potential of these big, complex, and non-linear data, whilst ensuring reproducible results, is to use automated machine learning (ML) and deep learning (DL) pipelines for their analysis. However, there remains a gap in comprehensive ML workflows tailored to proteomic biomarker discovery and designed for biomedical researchers who need pipelines to optimally self-configure and automatically avoid over-fitting. Findings: We present BiomarkerML, a cloud-based workflow for automated, reproducible, and efficient ML/DL analysis of proteomic data for biomarker discovery, designed for novice-ML users and implemented in Python, R and Workflow Description Language (WDL). BiomarkerML: ingests proteomic and clinical data alongside sample labels; pre-processes data for model fitting and optionally performs dimensionality reduction and visualization; fits a catalogue of ML and DL classification and regression models; and calculates model performance metrics for model comparison. Next, the workflow applies mean SHapley Additive exPlanations (SHAP) to quantify the contribution of each protein to model predictions across all samples. Finally, proteins with high mean SHAP values, and their co-expressed protein network interactors, are identified as candidate biomarkers. Importantly, hyperparameters - configuration variables set prior to training models - are automatically fine-tuned via grid-search, and BiomarkerML employs weighted, nested cross-validation to avoid model over-fitting and data leakage. Conclusions: BiomarkerML is scalable, provides a standardized, user-friendly interface, and streamlines analyses to ensure reproducibility of results. Overall, BiomarkerML is a significant advancement, enabling novice-ML researchers to use cutting-edge ML/DL tools to identify disease biomarkers in complex proteomic data. Keywords: machine learning, cloud-based workflow, classification and regression, proteomic biomarker identification

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average