High-throughput DNA sequence data compression

descriptionPublicationkeyboard_double_arrow_right Article 03 Dec 2013 English Publisher:Oxford University Press (OUP)Journal:Briefings in Bioinformatics, volume 16, pages 1-15 (issn: 1467-5463, eissn: 1477-4054,

Copyright policy )

Authors: Zexuan Zhu; Yongpeng Zhang; Zhen Ji; Shan He 0001; Xiao Yang;

doi: 10.1093/bib/bbt087

pmid: 24300111

High-throughput DNA sequence data compression

- Summary
- Subjects
- Metrics

Abstract

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. Compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.

Related Organizations

Broad Institute
United States
Shenzhen University
China (People's Republic of)
University of Birmingham
United Kingdom

Keywords

Molecular Sequence Data, High-Throughput Nucleotide Sequencing, Data Compression

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	90
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%