publication . Preprint . 2016

Sensitive protein sequence searching for the analysis of massive data sets

Martin Steinegger; Johannes Söding;
Open Access English
  • Published: 07 Oct 2016
Abstract
Sequencing costs have dropped much faster than Moore's law in the past decade, and sensitive sequence searching has become the main bottleneck in the analysis of large (meta)genomic datasets. While previous methods sacrificed sensitivity for speed gains, the parallelized, open-source software MMseqs2 overcomes this trade-off: In three-iteration profile searches it reaches 50% higher sensitivity than BLAST at 83-fold speed and the same sensitivity as PSI-BLAST at 270 times its speed. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.
Funded by
EC| Virus-X
Project
Virus-X
Virus-X: Viral Metagenomics for Innovation Value
  • Funder: European Commission (EC)
  • Project Code: 685778
  • Funding stream: H2020 | RIA
41 references, page 1 of 3

[1] E. Afshinnekoo, C. Meydan, S. Chowdhury, D. Jaroudi, C. Boyer, N. Bernstein, J. M. Maritz, D. Reeves, J. Gandara, S. Chhangawala, et al. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Systems, 1(1):72-87, 2015. [OpenAIRE]

[2] S. F. Altschul, T. L. Madden, A. A. Scha¨↵er, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389-3402, Sept. 1997.

[3] M. Arumugam, J. Raes, E. Pelletier, D. Le Paslier, T. Yamada, D. R. Mende, G. R. Fernandes, J. Tap, T. Bruls, J.-M. Batto, et al. Enterotypes of the human gut microbiome. Nature, 473(7346):174-180, 2011.

[4] B. Buchfink, C. Xie, and D. H. Huson. Fast and sensitive protein alignment using diamond. Nat. Methods, 12(1):59-60, 2015.

[5] N. Desai, D. Antonopoulos, J. A. Gilbert, E. M. Glass, and F. Meyer. From genomics to metagenomics. Curr. Opin. Biotechnol., 23(1):72-76, 2012.

[6] R. C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19):2460-2461, Oct. 2010.

[7] M. Farrar. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156-161, Jan. 2007.

[8] E. A. Franzosa, T. Hsu, A. Sirota-Madi, A. Shafquat, G. AbuAli, X. C. Morgan, and C. Huttenhower. Sequencing and beyond: integrating molecular'omics' for microbial community profiling. Nature Reviews Microbiology, 13(6):360-372, 2015.

[9] M. C. Frith, Y. Park, S. L. Sheetlin, and J. L. Spouge. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic acids research, 36(18):5863- 5871, 2008. [OpenAIRE]

[10] M. Hauser, M. Steinegger, and J. So¨ding. Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9):1323-1330, 2016.

[11] D. Haussler, S. J. O'Brien, O. A. Ryder, F. K. Barker, M. Clamp, A. J. Crawford, R. Hanner, O. Hanotte, W. E. Johnson, J. A. McGuire, et al. Genome 10k: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, 100(6):659-674, 2009.

[12] H. Hauswedell, J. Singer, and K. Reinert. Lambda: the local aligner for massive biological data. Bioinformatics, 30(17):i349-i355, 2014. [OpenAIRE]

[13] A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. U.S.A., 111(13):4904-4909, 2014.

[14] B. L. Hurwitz and M. B. Sullivan. The pacific ocean virome (pov): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS One, 8(2):e57355, 2013.

[15] L. J. Jensen, P. Julien, M. Kuhn, C. von Mering, J. Muller, T. Doerks, and P. Bork. eggnog: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res., 36(suppl 1):D250-D254, 2008. [OpenAIRE]

41 references, page 1 of 3
Abstract
Sequencing costs have dropped much faster than Moore's law in the past decade, and sensitive sequence searching has become the main bottleneck in the analysis of large (meta)genomic datasets. While previous methods sacrificed sensitivity for speed gains, the parallelized, open-source software MMseqs2 overcomes this trade-off: In three-iteration profile searches it reaches 50% higher sensitivity than BLAST at 83-fold speed and the same sensitivity as PSI-BLAST at 270 times its speed. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.
Funded by
EC| Virus-X
Project
Virus-X
Virus-X: Viral Metagenomics for Innovation Value
  • Funder: European Commission (EC)
  • Project Code: 685778
  • Funding stream: H2020 | RIA
41 references, page 1 of 3

[1] E. Afshinnekoo, C. Meydan, S. Chowdhury, D. Jaroudi, C. Boyer, N. Bernstein, J. M. Maritz, D. Reeves, J. Gandara, S. Chhangawala, et al. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Systems, 1(1):72-87, 2015. [OpenAIRE]

[2] S. F. Altschul, T. L. Madden, A. A. Scha¨↵er, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389-3402, Sept. 1997.

[3] M. Arumugam, J. Raes, E. Pelletier, D. Le Paslier, T. Yamada, D. R. Mende, G. R. Fernandes, J. Tap, T. Bruls, J.-M. Batto, et al. Enterotypes of the human gut microbiome. Nature, 473(7346):174-180, 2011.

[4] B. Buchfink, C. Xie, and D. H. Huson. Fast and sensitive protein alignment using diamond. Nat. Methods, 12(1):59-60, 2015.

[5] N. Desai, D. Antonopoulos, J. A. Gilbert, E. M. Glass, and F. Meyer. From genomics to metagenomics. Curr. Opin. Biotechnol., 23(1):72-76, 2012.

[6] R. C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19):2460-2461, Oct. 2010.

[7] M. Farrar. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156-161, Jan. 2007.

[8] E. A. Franzosa, T. Hsu, A. Sirota-Madi, A. Shafquat, G. AbuAli, X. C. Morgan, and C. Huttenhower. Sequencing and beyond: integrating molecular'omics' for microbial community profiling. Nature Reviews Microbiology, 13(6):360-372, 2015.

[9] M. C. Frith, Y. Park, S. L. Sheetlin, and J. L. Spouge. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic acids research, 36(18):5863- 5871, 2008. [OpenAIRE]

[10] M. Hauser, M. Steinegger, and J. So¨ding. Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9):1323-1330, 2016.

[11] D. Haussler, S. J. O'Brien, O. A. Ryder, F. K. Barker, M. Clamp, A. J. Crawford, R. Hanner, O. Hanotte, W. E. Johnson, J. A. McGuire, et al. Genome 10k: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, 100(6):659-674, 2009.

[12] H. Hauswedell, J. Singer, and K. Reinert. Lambda: the local aligner for massive biological data. Bioinformatics, 30(17):i349-i355, 2014. [OpenAIRE]

[13] A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. U.S.A., 111(13):4904-4909, 2014.

[14] B. L. Hurwitz and M. B. Sullivan. The pacific ocean virome (pov): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS One, 8(2):e57355, 2013.

[15] L. J. Jensen, P. Julien, M. Kuhn, C. von Mering, J. Muller, T. Doerks, and P. Bork. eggnog: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res., 36(suppl 1):D250-D254, 2008. [OpenAIRE]

41 references, page 1 of 3
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue