Learning sparse log-ratios for high-throughput sequencing data

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 12 Feb 2021 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 38, pages 157-163 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )Funded by:NSF | NeuroNex Theory Team: Col...

Authors: Elliott Gordon-Rodríguez; Thomas P. Quinn; John P. Cunningham;

doi: 10.1093/bioinformatics/btab645 , 10.1101/2021.02.11.430695

pmid: 34498030

pmc: PMC8696089

Learning sparse log-ratios for high-throughput sequencing data

- Summary
- Subjects
- Metrics

Abstract

Abstract Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability and implementation The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.

Related Organizations

Columbia University
United States
Columbia University
United States
Geelong Hospital
Australia
Barwon Health
Australia
King’s University
United States

View all View all

Keywords

Microbiota, High-Throughput Nucleotide Sequencing, Metagenomics, Original Papers, Software, Algorithms

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	26
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%