Robust $$k$$ k -mer frequency estimation using gapped $$k$$ k -mers

descriptionPublicationkeyboard_double_arrow_right Article 17 Jul 2013 English Publisher:Springer Science and Business Media LLCJournal:Journal of Mathematical Biology, volume 69, pages 469-500 (issn: 0303-6812, eissn: 1432-1416,

Copyright policy )

Authors: Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael A.;

doi: 10.1007/s00285-013-0705-3

pmid: 23861010

pmc: PMC3895138

Robust $$k$$ k -mer frequency estimation using gapped $$k$$ k -mers

- Summary
- Subjects
- Metrics

Abstract

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

Related Organizations

Johns Hopkins University
United States
University of Tehran
Iran (Islamic Republic of)
Broad Institute
United States
Institute for Research in Fundamental Sciences
Iran (Islamic Republic of)

Keywords

Binding Sites, frequency estimation, Genome, Human, DNA sequence, DNA sequences, DNA, Protein sequences, DNA sequences, oligomer, statistical learning, Humans, Theory of matrix inversion and generalized inverses, $k$-mer, Computational methods for problems pertaining to biology, Transcription Factors

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	51
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

51

Top 10%

bronze

Fields of Science (3) View all

medical and health sciences

basic medicine

Fields of Science

medical and health sciences

basic medicine

View all