
arXiv: 1807.07596
handle: 10447/337234 , 11568/946091
Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of $m$ strings simultaneously, in order to obtain $m$ ACS induced distances. Experimental results confirm the effectiveness of our approach.
Preliminary version of the paper that will be included in the SPIRE 2018 proceedings
Alignment-free methods, Burrows-wheeler transform, FOS: Computer and information sciences, Average common substring, Matching statistics, Computer Science (all), Theoretical Computer Science, Alignment-free methods; Average common substring; Burrows-wheeler transform; Longest common prefix; Matching statistics; Theoretical Computer Science; Computer Science (all), Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Longest common prefix
Alignment-free methods, Burrows-wheeler transform, FOS: Computer and information sciences, Average common substring, Matching statistics, Computer Science (all), Theoretical Computer Science, Alignment-free methods; Average common substring; Burrows-wheeler transform; Longest common prefix; Matching statistics; Theoretical Computer Science; Computer Science (all), Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Longest common prefix
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 3 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
