Constructing Antidictionaries in Output-Sensitive Space

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Contribution for newspaper or weekly magazine 01 Mar 2019Embargo end date: 01 Jan 2019 United Kingdom, Italy, France Publisher:IEEEJournal:2019 Data Compression Conference (DCC)

Authors: Lorraine A. K. Ayad; Golnaz Badkobeh; Gabriele Fici; Alice Héliou; Solon P. Pissis;

doi: 10.1109/dcc.2019.00062 , 10.48550/arxiv.1902.04785

arXiv: 1902.04785

handle: 10447/372623

Constructing Antidictionaries in Output-Sensitive Space

- Summary
- Subjects
- Related research
  (6)
- Metrics

Abstract

A word $x$ that is absent from a word $y$ is called minimal if all its proper factors occur in $y$. Given a collection of $k$ words $y_1,y_2,\ldots,y_k$ over an alphabet $��$, we are asked to compute the set $\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{k}}$ of minimal absent words of length at most $\ell$ of word $y=y_1\#y_2\#\ldots\#y_k$, $\#\notin��$. In data compression, this corresponds to computing the antidictionary of $k$ documents. In bioinformatics, it corresponds to computing words that are absent from a genome of $k$ chromosomes. This computation generally requires $��(n)$ space for $n=|y|$ using any of the plenty available $\mathcal{O}(n)$-time algorithms. This is because an $��(n)$-sized text index is constructed over $y$ which can be impractical for large $n$. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||=o(n)$, for all $N\in[1,k]$. For instance, in the human genome, $n \approx 3\times 10^9$ but $||\mathrm{M}^{12}_{y_{1}\#\ldots\#y_{k}}|| \approx 10^6$. We consider a constant-sized alphabet for stating our results. We show that all $\mathrm{M}^{\ell}_{y_{1}},\ldots,\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{k}}$ can be computed in $\mathcal{O}(kn+\sum^{k}_{N=1}||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||)$ total time using $\mathcal{O}(\mathrm{MaxIn}+\mathrm{MaxOut})$ space, where $\mathrm{MaxIn}$ is the length of the longest word in $\{y_1,\ldots,y_{k}\}$ and $\mathrm{MaxOut}=\max\{||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||:N\in[1,k]\}$. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Version accepted to DCC 2019

Countries

United Kingdom, Italy, France

Related Organizations

University of London
United Kingdom
University of Palermo
Italy
Inria Grenoble - Rhône-Alpes research centre
France
Independent Researcher
United Kingdom
King's College London
United Kingdom

View all View all

Keywords

Output sensitive algorithms, String algorithms, FOS: Computer and information sciences, [INFO] Computer Science [cs], Absent words, 004, Antidictionaries, Data compression, Absent words; Antidictionaries; Data compression; Output sensitive algorithms; String algorithms, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS)

6 Research products, page 1 of 1

Constructing Antidictionaries of Long Texts in Output-Sensitive Space
2020IsAmongTopNSimilarDocuments
Analysis of the Size of Antidictionary in DCA
2008IsAmongTopNSimilarDocuments
Antidictionary Data Compression Using Dynamic Suffix Trees
IsAmongTopNSimilarDocuments
A New Approach of DCA by using BWT
2005IsAmongTopNSimilarDocuments
Improved antidictionary based compression
2003IsAmongTopNSimilarDocuments
On a two-dimensional antidictionary construction using suffix tries
2015IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average