Simple Runs-Bounded FM-Index Designs Are Fast.

descriptionPublicationkeyboard_double_arrow_right Conference object , Article 01 Jan 2023 Germany, Finland Publisher:Schloss Dagstuhl – Leibniz-Zentrum für InformatikFunded by:AKA | Massively Parallel Algori..., AKA | Flavours of de Bruijn gra..., AKA | Dynamic Succinct Data Str...

Authors: Diego Díaz-Domínguez; Saska Dönges; Simon J. Puglisi; Leena Salmela;

handle: 10138/565007

Simple Runs-Bounded FM-Index Designs Are Fast.

- Summary
- Subjects
- Metrics

Abstract

Given a string X of length n on alphabet , the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-To-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rankc(L, i) (i.e., count the number of times c occurs in the prefix L[1.i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rankc(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

Peer reviewed

Countries

Germany, Finland

Related Organizations

Helsinki Institute for Information Technology
Finland
Schloss Dagstuhl – Leibniz Center for Informatics
Germany
University of Helsinki
Finland
Leibniz Association
Germany

Keywords

data structures, Computer and information sciences, efficient algorithms, 004, ddc: ddc:004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Funded by

AKA| Massively Parallel Algorithms and Analysis for Metagenomics and Pangenomics (MAPAMEPA), AKA| Flavours of de Bruijn graphs: from theory to practice, AKA| Dynamic Succinct Data Structures

Related to Research communities

UArctic