Fast, Small, and Simple Document Listing on Repetitive Text Collections

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint , Conference object 01 Jan 2019Embargo end date: 01 Jan 2019 English Publisher:Springer International Publishing

Authors: Dustin Cobas; Gonzalo Navarro 0001;

doi: 10.1007/978-3-030-32686-9_34 , 10.48550/arxiv.1902.07599

arXiv: 1902.07599

Fast, Small, and Simple Document Listing on Repetitive Text Collections

- Summary
- Subjects
- Related research
  (11)
- Metrics

Abstract

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $ndoc$ distinct documents where a pattern of length $m$ appears in time $\mathcal{O}(m+ndoc \cdot \log n)$. We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

Related Organizations

University of Chile
Chile

Keywords

FOS: Computer and information sciences, Computer Science - Information Theory, Information Theory (cs.IT), Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Information Retrieval (cs.IR), Computer Science - Information Retrieval

11 Research products, page 1 of 2

Enzymatic activity profile of a Brazilian culture collection of Candida albicans isolated from diabetics and non‐diabetics with oral candidiasis
2013IsAmongTopNSimilarDocuments
A Nurse-Driven Outpatient Clinic for Thiopurine-Treated Inflammatory Bowel Disease Patients Reduces Physician Visits and Increases Follow-Up Efficiency
2015IsAmongTopNSimilarDocuments
Constant-Time Tree Traversal and Subtree Equality Check for Grammar-Compressed Trees
2017IsAmongTopNSimilarDocuments
Tailoring r-index for metagenomics
2020IsAmongTopNSimilarDocuments
Access, Rank, and Select in Grammar-compressed Strings
2015IsAmongTopNSimilarDocuments
Random Access to Grammar-Compressed Strings and Trees
2015IsAmongTopNSimilarDocuments
Tailoring r-index for Document Listing Towards Metagenomics Applications
2020IsAmongTopNSimilarDocuments
Impossibility Results for Grammar-Compressed Linear Algebra
2020IsAmongTopNSimilarDocuments
Rank, select and access in grammar-compressed strings
2014IsAmongTopNSimilarDocuments
sdsl-lite software on GitHub
IsRelatedTo

chevron_left
1
2
chevron_right

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Green

Fields of Science (4) View all

natural sciences

computer and information sciences

Fields of Science

natural sciences

computer and information sciences

View all