research data . Dataset . 2016

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

GaĂŤtan Benoit; Lemaitre, Claire; Lavenier, Dominique; Drezen, Erwan; Dayris, Thibault; Uricaru, Raluca; Rizk, Guillaume;
  • Published: 14 Dec 2016
  • Publisher: Figshare
Abstract
Abstract Background Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. Results We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded...
Subjects
free text keywords: Genetics, FOS: Biological sciences, Molecular Biology, 69999 Biological Sciences not elsewhere classified, 80699 Information Systems not elsewhere classified, FOS: Computer and information sciences
Funded by
ANR| COLIB'READ
Project
COLIB'READ
METHODS FOR EFFICIENT DETECTION OF BIOLOGICAL INFORMATION FROM NON ASSEMBLED HTS DATA.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-BS02-0008
,
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
Download from
figshare
Dataset . 2016
Provider: Datacite
Abstract
Abstract Background Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. Results We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded...
Subjects
free text keywords: Genetics, FOS: Biological sciences, Molecular Biology, 69999 Biological Sciences not elsewhere classified, 80699 Information Systems not elsewhere classified, FOS: Computer and information sciences
Funded by
ANR| COLIB'READ
Project
COLIB'READ
METHODS FOR EFFICIENT DETECTION OF BIOLOGICAL INFORMATION FROM NON ASSEMBLED HTS DATA.
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-BS02-0008
,
ANR| GATB
Project
GATB
GENOMIC ASSEMBLY TOOL BOX
  • Funder: French National Research Agency (ANR) (ANR)
  • Project Code: ANR-12-EMMA-0019
Download from
figshare
Dataset . 2016
Provider: Datacite
Any information missing or wrong?Report an Issue