
handle: 20.500.11850/748331
AbstractMotivationBecause of the rapidly-growing amount of sequencing data, computingsketchesof large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis.Minimizersare an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of everywconsecutivek-mers. As DNA sequencers are getting more accurate, some applications can afford to use a largerwand hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck.MethodsOur librarysimd-minimizersimplements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministictwo-stackssliding window minimum algorithm, which seems not to have been used before for finding minimizers.ResultsOur library is up to 9.5× faster than a scalar implementation of therescanmethod whenw= 5 is small, and 4.5× faster for largerw= 19. Computingcanonicalminimizers is only around 50% slower than computing forward minimizers, and around 16× faster than the existing implementation in theminimizer-itercrate. Our library finds all (canonical) minimizers of a 3.2Gbp human genome in 4.1 (resp. 6.0) seconds.Supplementary MaterialSoftware:https://github.com/rust-seq/simd-minimizersFundingRagnar Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Rätsch.Igor Martayan: ENS Rennes Doctoral Grant.
Sketching, Hashing, Randomized algorithms, Minimizers; Randomized algorithms; Sketching; Hashing, Minimizers, ddc: ddc:004
Sketching, Hashing, Randomized algorithms, Minimizers; Randomized algorithms; Sketching; Hashing, Minimizers, ddc: ddc:004
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
