Substring Complexity in Sublinear Space

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Part of book or chapter of book 01 Jan 2020Embargo end date: 01 Jan 2020 Germany, Italy, Italy, France, Netherlands, Italy, France, Italy Publisher:arXivJournal:CoRR, volume abs/2007.08357Funded by:EC | PANGAIA, EC | ALPACA

Authors: Giulia Bernardini 0001; Gabriele Fici; Pawel Gawrychowski; Solon P. Pissis;

doi: 10.48550/arxiv.2007.08357

arXiv: 2007.08357

handle: 11368/3065698 , 1871.1/8f8acba6-906e-424e-9f6b-43debd84bab5 , 2434/1131463 , 10447/619240

Substring Complexity in Sublinear Space

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size $z$ of the Lempel-Ziv parse or the number $r$ of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size $γ$ of a smallest string attractor. Let $T$ be a string of length $n$. A string attractor of $T$ is a set of positions of $T$ capturing the occurrences of all the substrings of $T$. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing $γ$ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function $S_T(k)$ counting the number of distinct substrings of length $k$ of $T$, also known as the substring complexity of $T$. This new measure is defined as $δ= \sup\{S_T(k)/k, k\geq 1\}$ and lower bounds all the relevant ad hoc measures previously considered. In particular, $δ\leq γ$ always holds and $δ$ can be computed in $\mathcal{O}(n)$ time using $Θ(n)$ working space. Kociumaka et al. showed that one can construct an $\mathcal{O}(δ\log \frac{n}δ)$-sized representation of $T$ supporting efficient direct access and efficient pattern matching queries on $T$. Given that for highly compressible strings, $δ$ is significantly smaller than $n$, it is natural to pose the following question: Can we compute $δ$ efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute $δ$: $\mathcal{O}(\frac{n^3\log b}{b^2})$ time using $\mathcal{O}(b)$ space, for any $b\in[1,n]$, in the comparison model; or $\tilde{\mathcal{O}}(n^2/b)$ time using $\tilde{\mathcal{O}}(b)$ space, for any $b\in[\sqrt{n},n]$, in the word RAM model.

Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv requirements

Countries

Germany, Italy, Italy, France, Netherlands, Italy, France, Italy

Related Organizations

University of Milan
Italy
Vrije Universiteit Amsterdam
Netherlands
Leibniz Association
Germany
University of Wrocław
Poland
University of Palermo
Italy

View all View all

Keywords

FOS: Computer and information sciences, substring complexity, sublinear-space algorithm; string algorithm; substring complexity, Settore INF/01 - Informatica, sublinear-space algorithm, [INFO] Computer Science [cs], string algorithm, 004, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), sublinear-space algorithm, string algorithm, substring complexity, string algorithm; sublinear-space algorithm; substring complexity, ddc: ddc:004

1 Research products, page 1 of 1

Streaming Pattern Matching (Invited Talk).
2021IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average