Fast and linear-time string matching algorithms based on the distances of $q$-gram occurrences

Given a text $T$ of length $n$ and a pattern $P$ of length $m$, the string matching problem is a task to find all occurrences of $P$ in $T$. In this study, we propose an algorithm that solves this problem in $O((n + m)q)$ time considering the distance between two adjacent occurrences of the same $q$-gram contained in $P$. We also propose a theoretical improvement of it which runs in $O(n + m)$ time, though it is not necessarily faster in practice. We compare the execution times of our and existing algorithms on various kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonacci string. The experimental results show that our algorithm is as fast as the state-of-the-art algorithms in many cases, particularly when a pattern frequently appears in a text.

14 pages, accepted to SEA 2020

Related Organizations

Leibniz Association
Germany
Schloss Dagstuhl – Leibniz Center for Informatics
Germany
Tohoku University
Japan

Keywords

FOS: Computer and information sciences, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), text processing, String matching algorithm, 004, ddc: ddc:004

1 Research products, page 1 of 1

distq software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average