publication . Article . Other literature type . 2015

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Claire Lemaitre; Erwan Drezen; Gaëtan Benoit; Raluca Uricaru; Thibault Dayris; Guillaume Rizk; Dominique Lavenier;
Open Access English
  • Published: 14 Sep 2015 Journal: BMC Bioinformatics, volume 16 (eissn: 1471-2105, Copyright policy)
  • Publisher: BioMed Central
  • Country: France
Abstract
Background Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. Results We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a pat...
Subjects
free text keywords: Research Article, Compression, de Bruijn Graph, NGS, Bloom filter, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Biochemistry, Applied Mathematics, Molecular Biology, Structural Biology, Computer Science Applications, Lossy compression, Data compression, symbols.namesake, symbols, Theoretical computer science, Probabilistic logic, Metagenomics, Software, business.industry, business, Computer science, File size
Funded by
ANR| COLIB'READ
Project
COLIB'READ
METHODS FOR EFFICIENT DETECTION OF BIOLOGICAL INFORMATION FROM NON ASSEMBLED HTS DATA.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-BS02-0008
,
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
30 references, page 1 of 2

1. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010;39:1019.

2. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012;40(22):171. doi:10.1093/nar/gks754.

3. Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res. 2011;21:734-40. doi:10.1101/gr.114819.110.

4. Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics. 2015;31:071.

5. Bonfield JK, Mahoney MV. Compression of fastq and sam format sequencing data. PLoS One. 2013;8(3):59190. doi:10.1371/journal.pone.0059190.

6. Hach F, Numanagic I, Alkan C, Sahinalp SC. Scalce: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012;28(23):3051-057. doi:10.1093/bioinformatics/bts593. [OpenAIRE]

7. Deorowicz S, Grabowski S. Compression of dna sequence reads in fastq format. Bioinformatics. 2011;27(6):860-2.

8. Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. Bioinformatics. 2014;31:844. [OpenAIRE]

9. Janin L, Schulz-Trieglaff O, Cox AJ. Beetl-fastq: a searchable compressed archive for dna reads. Bioinformatics. 2014;30:387. [OpenAIRE]

10. Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics. 2015;31:248.

11. Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics. 2012;28(11):1415-9.

12. Wan R, Anh VN, Asai K. Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics. 2012;28(5):628-35.

13. Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics. 2014;30(15):2130-136.

14. Janin L, Rosone G, Cox AJ. Adaptive reference-free compression of sequence quality scores. Bioinformatics. 2013;30:257. [OpenAIRE]

15. Yu YW, Yorukoglu D, Berger B. Traversing the k-mer landscape of ngs read datasets for quality score sparsification. In: Research in computational molecular biology. Springer; 2014. p. 385-99. [OpenAIRE]

30 references, page 1 of 2
Abstract
Background Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. Results We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a pat...
Subjects
free text keywords: Research Article, Compression, de Bruijn Graph, NGS, Bloom filter, [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Biochemistry, Applied Mathematics, Molecular Biology, Structural Biology, Computer Science Applications, Lossy compression, Data compression, symbols.namesake, symbols, Theoretical computer science, Probabilistic logic, Metagenomics, Software, business.industry, business, Computer science, File size
Funded by
ANR| COLIB'READ
Project
COLIB'READ
METHODS FOR EFFICIENT DETECTION OF BIOLOGICAL INFORMATION FROM NON ASSEMBLED HTS DATA.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-BS02-0008
,
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
30 references, page 1 of 2

1. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010;39:1019.

2. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012;40(22):171. doi:10.1093/nar/gks754.

3. Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res. 2011;21:734-40. doi:10.1101/gr.114819.110.

4. Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics. 2015;31:071.

5. Bonfield JK, Mahoney MV. Compression of fastq and sam format sequencing data. PLoS One. 2013;8(3):59190. doi:10.1371/journal.pone.0059190.

6. Hach F, Numanagic I, Alkan C, Sahinalp SC. Scalce: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012;28(23):3051-057. doi:10.1093/bioinformatics/bts593. [OpenAIRE]

7. Deorowicz S, Grabowski S. Compression of dna sequence reads in fastq format. Bioinformatics. 2011;27(6):860-2.

8. Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. Bioinformatics. 2014;31:844. [OpenAIRE]

9. Janin L, Schulz-Trieglaff O, Cox AJ. Beetl-fastq: a searchable compressed archive for dna reads. Bioinformatics. 2014;30:387. [OpenAIRE]

10. Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics. 2015;31:248.

11. Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics. 2012;28(11):1415-9.

12. Wan R, Anh VN, Asai K. Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics. 2012;28(5):628-35.

13. Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics. 2014;30(15):2130-136.

14. Janin L, Rosone G, Cox AJ. Adaptive reference-free compression of sequence quality scores. Bioinformatics. 2013;30:257. [OpenAIRE]

15. Yu YW, Yorukoglu D, Berger B. Traversing the k-mer landscape of ngs read datasets for quality score sparsification. In: Research in computational molecular biology. Springer; 2014. p. 385-99. [OpenAIRE]

30 references, page 1 of 2
Any information missing or wrong?Report an Issue