Universal indexes for highly repetitive document collections

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 01 Oct 2016Embargo end date: 01 Jan 2016 Chile, Spain English Publisher:Elsevier BVJournal:Information Systems, volume 61, pages 1-23 (issn: 0306-4379,

Copyright policy )Funded by:EC | BIRDS

Authors: Francisco Claude; Antonio Fariña; Miguel A. Martínez-Prieto; Gonzalo Navarro 0001;

doi: 10.1016/j.is.2016.04.002 , 10.48550/arxiv.1604.08897

arXiv: 1604.08897

handle: 2183/18163

Universal indexes for highly repetitive document collections

- Summary
- Subjects
- Related research
  (6)
- Metrics

Abstract

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.

This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

Countries

Chile, Spain

Related Organizations

Keywords

Self-index, FOS: Computer and information sciences, Repetitive collections, Computer Science - Digital Libraries, Digital Libraries (cs.DL), Inverted index, Information Retrieval (cs.IR), Computer Science - Information Retrieval

6 Research products, page 1 of 1

Efficient content-based image retrieval in digital picture collections using projections: (near)-copy location
1996IsAmongTopNSimilarDocuments
<title>Application of composite invisible image watermarks to simplify detection of a distinct watermark from a large set</title>
2002IsAmongTopNSimilarDocuments
uiHRDC
2018IsSupplementedBy
uiHRDC software on GitHub
IsRelatedTo
partitioned_elias_fano software on GitHub
IsRelatedTo
SIMDCompressionAndIntersection software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	22
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%