descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 15 Jan 2020Embargo end date: 01 Jan 2018 English Publisher:Association for Computing Machinery (ACM)Journal:Journal of the ACM, volume 67, pages 1-54 (issn: 0004-5411, eissn: 1557-735X,

Authors: GAGIE Travis; NAVARRO Gonzalo; PREZZA Nicola;

doi: 10.1145/3375890 , 10.48550/arxiv.1809.02792

arXiv: 1809.02792

handle: 10278/3729797 , 11385/192324

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

- Summary
- Subjects
- Related research
  (10)
- Metrics

Abstract

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r , the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O ( r ) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O ( m log log n ) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r . In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O ( occ log log n ) time) within O ( r ) space. By raising the space to O ( r log log n ), our index counts the occurrences in optimal time, O ( m ), and locates them in optimal time as well, O ( m + occ ). By further raising the space by an O ( w / log σ) factor, where σ is the alphabet size and w = Ω (log n ) is the RAM machine size in bits, we support count and locate in O (⌈ m log (σ)/ w ⌉) and O (⌈ m log (σ)/ w ⌉ + occ ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O ( r log ( n / r )) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O (log ( n / r )+ℓ log (σ)/ w ). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O (log ( n / r )), and extend these capabilities to full suffix tree functionality, typically in O (log ( n / r )) time per operation. Our experiments show that our O ( r )-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

Related Organizations

Dalhousie University
Canada
Ca Foscari University of Venice
Italy
Guido Carli Free International University for Social Studies
Italy
Diego Portales University
Chile
University of Pisa
Italy

View all View all

Keywords

FOS: Computer and information sciences, Theory of computation, Design and analysis of algorithms, Data structures design and analysis, Pattern matching, Computer Science - Data Structures and Algorithms, Compressed text indexes, Data Structures and Algorithms (cs.DS), Repetitive string collections, Compressed suffix trees, Burrows-Wheeler transform, Theory of computation; Design and analysis of algorithms; Data structures design and analysis; Pattern matching

10 Research products, page 1 of 1

Indexing compressed text
2005IsAmongTopNSimilarDocuments
Optimal-Time Dictionary-Compressed Indexes
2020IsAmongTopNSimilarDocuments
Data Structures for Path Queries
2016IsAmongTopNSimilarDocuments
Compressed indexes for dynamic text collections
2007IsAmongTopNSimilarDocuments
r-index software on GitHub
IsRelatedTo
uiHRDC software on GitHub
IsRelatedTo
HydridSelfIndex software on GitHub
IsRelatedTo
sdsl-lite software on GitHub
IsRelatedTo
locate-cdawg software on GitHub
IsRelatedTo
rlcsa software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	110
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%