publication . Conference object . Article . Other literature type . Part of book or chapter of book . 2012

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi; Guillaume Rizk;
Open Access English
  • Published: 10 Sep 2012
  • Publisher: HAL CCSD
  • Country: France
Abstract
International audience; Background<br />The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB).<br />Results<br />We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives.<br />Conclusions<br />An assembly software implementing this structure, Minia...
Subjects
free text keywords: [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], [SDV.BBM.BM]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Molecular biology, Computational Theory and Mathematics, Applied Mathematics, Molecular Biology, Structural Biology, Research, de novo assembly, Data structure, Graph, Theoretical computer science, Assemblers, comic_books.character, comic_books, Bloom filter, Algorithm, De Bruijn graph, symbols.namesake, symbols, Computer science, Assembly software, Sequence assembly, Encoding (memory), False positive paradox
Funded by
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
,
ANR| MAPPI
Project
MAPPI
NEW SEQUENTIAL AND DISTRIBUTED APPROACHES IN ALGORITHMICS AND COMPUTATIONAL BIOLOGY FOR THE ANALYSIS OF DATA GENERATED BY NEXT GENERATION SEQUENCERS.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-10-COSI-0004
23 references, page 1 of 2

1. Idury RM, Waterman MS: A new algorithm for DNA sequence assembly. J Comput Biol 1995, 2(2):291-306.

2. Grabherr MG: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech 2011, 29(7):644-652. [http://dx.doi.org/10.1038/nbt.1883] [OpenAIRE]

3. Peng Y, Leung HCM, Yiu SM, Chin FYL: Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 2011, 27(13):i94-i101. [OpenAIRE]

4. Peterlongo P, Schnel N, Pisanti N, Sagot MF, Lacroix V: Identifying SNPs without a reference genome by comparing raw reads. In String Processing and Information Retrieval. Berlin, Heidelberg: Springer; 2010:147-158.

5. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 2012, 44:226-232. [OpenAIRE]

6. Sacomoto G, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M, Peterlongo P, Lacroix V: KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinformatics 2012, 13(Suppl 6):S5. [http://www.biomedcentral.com/1471-2105/13/S6/S5]

7. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20(2):265.

8. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Res 2009, 19(6):1117-1123. [http://genome.cshlp.org/content/19/6/1117.abstract]

9. Conway TC, Bromage AJ: Succinct data structures for assembling large genomes. Bioinformatics 2011, 27(4):479.

10. Warren RL, Holt RA: Targeted assembly of short sequence reads. PloS One 2011, 6(5):e19816.

11. Peterlongo P, Chikhi R: Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer. BMC Bioinformatics 2012, 13:48. [OpenAIRE]

12. Ye C, Ma Z, Cannon C, Pop M, Yu D: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 2012, 13(Suppl 6):S1. [http://www.biomedcentral.com/1471-2105/13/S6/S1] [OpenAIRE]

13. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Arxiv preprint arXiv:1112.4193 2011.

14. Kirsch A, Mitzenmacher M: Less hashing, same performance: Building a better Bloom filter. Algorithms-ESA 2006, 4168:456-467. [OpenAIRE]

15. Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics 2010, 95(6):315-327.

23 references, page 1 of 2
Abstract
International audience; Background<br />The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB).<br />Results<br />We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives.<br />Conclusions<br />An assembly software implementing this structure, Minia...
Subjects
free text keywords: [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], [SDV.BBM.BM]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Molecular biology, Computational Theory and Mathematics, Applied Mathematics, Molecular Biology, Structural Biology, Research, de novo assembly, Data structure, Graph, Theoretical computer science, Assemblers, comic_books.character, comic_books, Bloom filter, Algorithm, De Bruijn graph, symbols.namesake, symbols, Computer science, Assembly software, Sequence assembly, Encoding (memory), False positive paradox
Funded by
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
,
ANR| MAPPI
Project
MAPPI
NEW SEQUENTIAL AND DISTRIBUTED APPROACHES IN ALGORITHMICS AND COMPUTATIONAL BIOLOGY FOR THE ANALYSIS OF DATA GENERATED BY NEXT GENERATION SEQUENCERS.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-10-COSI-0004
23 references, page 1 of 2

1. Idury RM, Waterman MS: A new algorithm for DNA sequence assembly. J Comput Biol 1995, 2(2):291-306.

2. Grabherr MG: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech 2011, 29(7):644-652. [http://dx.doi.org/10.1038/nbt.1883] [OpenAIRE]

3. Peng Y, Leung HCM, Yiu SM, Chin FYL: Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 2011, 27(13):i94-i101. [OpenAIRE]

4. Peterlongo P, Schnel N, Pisanti N, Sagot MF, Lacroix V: Identifying SNPs without a reference genome by comparing raw reads. In String Processing and Information Retrieval. Berlin, Heidelberg: Springer; 2010:147-158.

5. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 2012, 44:226-232. [OpenAIRE]

6. Sacomoto G, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M, Peterlongo P, Lacroix V: KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinformatics 2012, 13(Suppl 6):S5. [http://www.biomedcentral.com/1471-2105/13/S6/S5]

7. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20(2):265.

8. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Res 2009, 19(6):1117-1123. [http://genome.cshlp.org/content/19/6/1117.abstract]

9. Conway TC, Bromage AJ: Succinct data structures for assembling large genomes. Bioinformatics 2011, 27(4):479.

10. Warren RL, Holt RA: Targeted assembly of short sequence reads. PloS One 2011, 6(5):e19816.

11. Peterlongo P, Chikhi R: Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer. BMC Bioinformatics 2012, 13:48. [OpenAIRE]

12. Ye C, Ma Z, Cannon C, Pop M, Yu D: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 2012, 13(Suppl 6):S1. [http://www.biomedcentral.com/1471-2105/13/S6/S1] [OpenAIRE]

13. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Arxiv preprint arXiv:1112.4193 2011.

14. Kirsch A, Mitzenmacher M: Less hashing, same performance: Building a better Bloom filter. Algorithms-ESA 2006, 4168:456-467. [OpenAIRE]

15. Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics 2010, 95(6):315-327.

23 references, page 1 of 2
Any information missing or wrong?Report an Issue