
arXiv: 2105.03106
In the classic longest common substring (LCS) problem, we are given two strings \(S\) and \(T\) of total length \(n\) over an alphabet of size \(\sigma\) , and we are asked to find a longest string occurring as a fragment of both \(S\) and \(T\) . Weiner, in his seminal paper that introduced the suffix tree, presented an \(\mathcal{O}(n\log\sigma)\) -time algorithm for the LCS problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an \(\mathcal{O}(n)\) -time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in \(\mathcal{O}(n\log\sigma/\log n)\) space and read in \(\mathcal{O}(n\log\sigma/\log n)\) time. We show that we can compute an LCS of two strings in time \(\mathcal{O}(n\log\sigma/\sqrt{\log n})\) in the word RAM model, which is sublinear in \(n\) if \(\sigma=2^{o(\sqrt{\log n})}\) (in particular, if \(\sigma=\mathcal{O}(1)\) ), using optimal space \(\mathcal{O}(n\log\sigma/\log n)\) . In fact, it was recently shown that this result is conditionally optimal [Kempa and Kociumaka, STOC 2025]. The same complexity can be achieved for computing an LCS of \(\lambda=\mathcal{O}(\sqrt{\log n}/\log\log n)\) input strings of total length \(n\) . We then lift our ideas to the problem of computing a \(k\) -mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of \(S\) that occurs in \(T\) with at most \(k\) mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in \(\mathcal{O}(n\log n)\) time [IPL 2015]. Thankachan et al. extended this result to computing a \(k\) -mismatch LCS in \(\mathcal{O}(n\log^{k}n)\) time for \(k=\mathcal{O}(1)\) [J. Comput. Biol. 2016]. We show an \(\mathcal{O}(n\log^{k-0.5}n)\) -time algorithm, for any constant integer \(k>0\) and irrespective of the alphabet size, using \(\mathcal{O}(n)\) space as the previous approaches. We thus notably break through the well-known \(n\log^{k}n\) barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. for string indexing with \(k\) errors [STOC 2004]. As a by-product, we improve upon the algorithm of Charalampopoulos et al. [CPM 2018] for computing a \(k\) -mismatch LCS in the case when the output \(k\) -mismatch LCS is sufficiently long.
FOS: Computer and information sciences, wavelet tree, Data Structures and Algorithms, K mismatches, [INFO] Computer Science [cs], 004, Longest common substring, Wavelet tree, Data Structures and Algorithms (cs.DS), Theory of computation → Pattern matching, longest common substring, k mismatches, ddc: ddc:004
FOS: Computer and information sciences, wavelet tree, Data Structures and Algorithms, K mismatches, [INFO] Computer Science [cs], 004, Longest common substring, Wavelet tree, Data Structures and Algorithms (cs.DS), Theory of computation → Pattern matching, longest common substring, k mismatches, ddc: ddc:004
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
