Weighted minimizer sampling improves long read mapping

descriptionPublicationkeyboard_double_arrow_right Article 01 Jul 2020 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 36, pages i111-i118 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian Walenz; Sergey Koren; Adam M. Phillippy;

doi: 10.1093/bioinformatics/btaa435

pmid: 32657365

pmc: PMC7355284

Weighted minimizer sampling improves long read mapping

- Summary
- Subjects
- Metrics

Abstract

Abstract Motivation In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. Results We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. Availability and implementation Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.

Related Organizations

Georgia Institute of Technology
United States
National Institutes of Health
United States
National Institute of Health
Pakistan
National Human Genome Research Institute
United States
National Institute of Health (NIH/NICHD)
United States

Keywords

High-Throughput Nucleotide Sequencing, Humans, Genomics, Sequence Analysis, DNA, Data Compression, Algorithms, Software

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	166
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%

Found an issue? Give us feedback

166

Top 1%

Top 10%

Top 1%

gold

Fields of Science (4) View all

engineering and technology

medical engineering

Fields of Science

engineering and technology

medical engineering

View all