Fast search of thousands of short-read sequencing experiments

descriptionPublicationkeyboard_double_arrow_right Article 08 Feb 2016 English Publisher:Springer Science and Business Media LLCJournal:Nature Biotechnology, volume 34, pages 300-302 (issn: 1087-0156, eissn: 1546-1696,

Copyright policy )Funded by:NIH | Integrated, Interdiscipli..., NIH | Fast k-mer Counting to Qu..., NSF | CAREER: Model-based Recon... +2 projects

Authors: Solomon, Brad; Kingsford, Carl;

doi: 10.1038/nbt.3442

pmid: 26854477

pmc: PMC4804353

Fast search of thousands of short-read sequencing experiments

- Summary
- Subjects
- Metrics

Abstract

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

Related Organizations

Carnegie Mellon University
United States
University of Pittsburgh
United States

Keywords

Sequence Analysis, RNA, Data Mining, High-Throughput Nucleotide Sequencing, Humans, RNA, Sequence Analysis, DNA, Article, Algorithms

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	109
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%