publication . Preprint . Other literature type . 2018

Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing

Luo, Ruibang; Sedlazeck, Fritz J.; Lam, Tak-Wah; Schatz, Michael C.;
Open Access English
  • Published: 28 Apr 2018
  • Publisher: Cold Spring Harbor Laboratory
Abstract
<jats:title>Abstract</jats:title><jats:p>The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5%-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieved 99.73%, 97.68% and 95.36% precision on known variants, and 98.65%, 92.57%, 87.26% F1-score for whole-genome analysis, using Il...
Subjects
free text keywords: Artificial neural network, Genomics, Indel, Convolutional neural network, Nanopore sequencing, Computer science, Zygosity, Word error rate, Computational biology, DNA sequencing
Related Organizations
Funded by
NIH| Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 2R01HG006677-15A1
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NSF| CAREER: Algorithms for single molecule sequence analysis
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1350041
  • Funding stream: Directorate for Biological Sciences | Division of Biological Infrastructure
,
NIH| Genomic Architecture of Common Disease in Diverse Populations
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3UM1HG008898-01S3
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

1 Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17, 333-351, doi:10.1038/nrg.2016.49 (2016).

2 Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39, e90, doi:10.1093/nar/gkr344 (2011).

3 Hatem, A., Bozdag, D., Toland, A. E. & Catalyurek, U. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184, doi:10.1186/1471-2105-14- 184 (2013). [OpenAIRE]

4 Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851, doi:10.1093/bioinformatics/btu356 (2014).

5 Luo, R., Schatz, M. C. & Salzberg, S. L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience (2017).

6 Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43, 11 10 11-33, doi:10.1002/0471250953.bi1110s43 (2013).

7 Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet, doi:10.1038/s41576-018-0003-4 (2018).

8 LeCun, Y. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1999).

9 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818-2826.

Abstract
<jats:title>Abstract</jats:title><jats:p>The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5%-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieved 99.73%, 97.68% and 95.36% precision on known variants, and 98.65%, 92.57%, 87.26% F1-score for whole-genome analysis, using Il...
Subjects
free text keywords: Artificial neural network, Genomics, Indel, Convolutional neural network, Nanopore sequencing, Computer science, Zygosity, Word error rate, Computational biology, DNA sequencing
Related Organizations
Funded by
NIH| Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 2R01HG006677-15A1
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
,
NSF| CAREER: Algorithms for single molecule sequence analysis
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1350041
  • Funding stream: Directorate for Biological Sciences | Division of Biological Infrastructure
,
NIH| Genomic Architecture of Common Disease in Diverse Populations
Project
  • Funder: National Institutes of Health (NIH)
  • Project Code: 3UM1HG008898-01S3
  • Funding stream: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

1 Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17, 333-351, doi:10.1038/nrg.2016.49 (2016).

2 Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39, e90, doi:10.1093/nar/gkr344 (2011).

3 Hatem, A., Bozdag, D., Toland, A. E. & Catalyurek, U. V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184, doi:10.1186/1471-2105-14- 184 (2013). [OpenAIRE]

4 Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851, doi:10.1093/bioinformatics/btu356 (2014).

5 Luo, R., Schatz, M. C. & Salzberg, S. L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience (2017).

6 Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43, 11 10 11-33, doi:10.1002/0471250953.bi1110s43 (2013).

7 Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet, doi:10.1038/s41576-018-0003-4 (2018).

8 LeCun, Y. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1999).

9 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818-2826.

Any information missing or wrong?Report an Issue