descriptionPublicationkeyboard_double_arrow_right Article 09 Oct 2017 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 34, pages 558-567 (issn: 1367-4803, eissn: 1367-4811,

Authors: Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman;

doi: 10.1093/bioinformatics/btx639

pmid: 29444237

pmc: PMC5860611

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Abstract Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×–2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

Related Organizations

Stanford University
United States
Department of Electrical Engineering and Computer Science University of Michigan
United States

Keywords

Genome, Bacteria, Eukaryota, High-Throughput Nucleotide Sequencing, Genomics, Sequence Analysis, DNA, Data Compression, Humans, Algorithms, Software

2 Research products, page 1 of 1

HARC software on GitHub
IsRelatedTo
libbsd software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	34
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%