Significant speedup of database searches with HMMs by search space reduction with PSSM family models

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 14 Oct 2009 Germany English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 25, pages 3,251-3,258 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Beckstette, Michael; Homann, Robert; Giegerich, Robert; Kurtz, Stefan;

doi: 10.1093/bioinformatics/btp593

pmid: 19828575

pmc: PMC2788931

Significant speedup of database searches with HMMs by search space reduction with PSSM family models

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

Abstract Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92. Availability: The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/. Contact: beckstette@zbh.uni-hamburg.de Supplementary information: Supplementary data are available at Bioinformatics online.

Country

Germany

Related Organizations

Bielefeld University
Germany
Universität Hamburg
Germany

Keywords

Computational Biology, Proteins, Original Papers, Markov Chains, Pattern Recognition, Automated, Sequence Analysis, Protein, Position-Specific Scoring Matrices, Databases, Protein, Sequence Alignment, Algorithms, Software

4 Research products, page 1 of 1

MOESM3 of The discovery of novel LPMO families with a new Hidden Markov model
2017IsAmongTopNSimilarDocuments
PoSSuMsearch
2016IsSupplementedBy
METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks
2021IsAmongTopNSimilarDocuments
Evaluating the use of GPUs in liver image segmentation and HMMER database searches
2009IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	10
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average