
Abstract Motivation As a fundamental task in bioinformatics, searching for massive short patterns over a long text has been accelerated by various compressed full-text indexes. These indexes are able to provide similar searching functionalities to classical indexes, e.g. suffix trees and suffix arrays, while requiring less space. For genomic data, a well-known family of compressed full-text indexes, called FM-indexes, presents unmatched performance in practice. One major drawback of FM-indexes is that their locating operations, which report all occurrence positions of patterns in a given text, are not efficient, especially for the patterns with many occurrences. Results In this paper, we introduce a novel locating algorithm, FMtree, to fast retrieve all occurrence positions of any pattern via FM-indexes. When searching for a pattern over a given text, FMtree organizes the search space of the locating operation into a conceptual multiway tree. As a result, multiple occurrence positions of this pattern can be retrieved simultaneously by traversing the multiway tree. Compared with existing locating algorithms, our tree-based algorithm reduces large numbers of redundant operations and presents better data locality. Experimental results show that FMtree is usually one order of magnitude faster than the state-of-the-art algorithms, and still memory-efficient. Availability and implementation FMtree is freely available at https://github.com/chhylp123/FMtree. Supplementary information Supplementary data are available at Bioinformatics online.
FOS: Computer and information sciences, Genome, Genomics, Sequence Analysis, DNA, Mice, Computer Science - Data Structures and Algorithms, Animals, Humans, Data Structures and Algorithms (cs.DS), Algorithms, Software
FOS: Computer and information sciences, Genome, Genomics, Sequence Analysis, DNA, Mice, Computer Science - Data Structures and Algorithms, Animals, Humans, Data Structures and Algorithms (cs.DS), Algorithms, Software
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 8 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
