Compression of quantification uncertainty for scRNA-seq counts

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 06 Jul 2020 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 37, pages 1,699-1,707 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )Funded by:NSF | CSR: Medium: Approximate ..., NIH | Cancer Center Support Gra..., NSF | CAREER: A Comprehensive a... +2 projects

Authors: Scott Van Buren; Hirak Sarkar; Avi Srivastava; Naim U. Rashid; Rob Patro; Michael I. Love;

doi: 10.1093/bioinformatics/btab001 , 10.1101/2020.07.06.189639

pmid: 33471073

pmc: PMC8289386

Compression of quantification uncertainty for scRNA-seq counts

- Summary
- Subjects
- Metrics

Abstract

Abstract Motivation Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. Results We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. Availability and implementation makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. Supplementary information Supplementary data are available at Bioinformatics online.

Related Organizations

University of Maryland, College Park
United States
Department of Computer Science University of Maryland
United States
New York Genome Center
United States
New York University
United States
Harvard University
United States

View all View all

Keywords

Original Papers

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average