
doi: 10.2139/ssrn.3528533
Approximate substring searching is a common but computationally demanding task in bioinformatics and text analysis. We present a new approach that recasts string search as a multiple convolution problem, then exploits highly efficient fast Fourier convolution techniques. This approach, which we call ffgrep, computes and caches the spectra of a target corpora, drastically reducing the cost of subsequent searches. Like other approaches, this algorithm is embarrassingly parallelizable; unlike other approaches, it is capable of operating on not only raw strings, but also word embeddings. ffgrep is applied to an original corpus of imperfect automatic transcriptions of campaign speeches in the 2012 U.S. presidential election. We contrast our approach with agrep, an industry-standard meta-algorithm that selects the optimal member from a number of highly optimized approximate string matching algorithms. Searching for approximate recurrences of a manually curated set of candidate catchphrases, we show that ffgrep speeds computation by up to a factor of 60x in typical settings, with increasing gains as alignments grow longer or more complex. Moreover, these computational gains come at little cost in performance. Taking agrep search results as ground truth, over a wide range of agrep parameters, we show that ffgrep is capable of recovering highly similar results with accuracies exceeding 0.94 and F1 of 0.84–0.9. Finally, we demonstrate how efficient substring matching enables new substantive research by identifying candidate catchphrases without human supervision. By rapidly computing and organizing 90 billion pairwise string comparisons, our proposed method automatically learns that the phrases “kick children off of Head Start or eliminate health insurance for the poor” and “kick students are [sic] financial aid or get rid of funding for Planned Parenthood or eliminate health care for millions on Medicaid” — along with 32 other campaign appeals — all map onto a single recurring theme, President Barack Obama’s critique of a proposed Medicare reform.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
