Determining population structure from k-mer frequencies

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Conference object 27 May 2022Publisher:ACMJournal:Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsFunded by:NSF | CAREER: Facilitating the ...

Authors: Hrytsenko, Yana; Daniels, Noah M.; Schwartz, Rachel S.;

doi: 10.1145/3535508.3545100 , 10.21203/rs.3.rs-1689838/v2 , 10.7717/peerj.18939 , 10.21203/rs.3.rs-1689838/v1

pmid: 40061228

pmc: PMC11890038

Determining population structure from k-mer frequencies

- Summary
- Subjects
- Metrics

Abstract

Abstract Background: Understanding population structure within species provides information on connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies. Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we identify population structure from DNA sequence data using an alignment-free approach. We use the frequencies of short DNA substrings from across the genome (k-mers) with principal component analysis (PCA). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. In contrast, most population structure work employing PCA uses multilocus genotype data (SNPs, microsatellites, or haplotypes). No genetic assumptions must be met to generate k-mers, whereas current population structure approaches often depend on several genetic assumptions and can require careful selection of ancestry informative markers to identify populations. Results: In this work, we show that PCA is able to determine population structure just from the frequency of k-mers found in the genome. The application of PCA and a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting the number and composition of populations (clusters) present in the dataset. The results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project, as well as simulations. Conclusions: This study shows that PCA, together with the clustering algorithm, is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin. Using k-mer frequencies to determine population structure has the potential to avoid some challenges of existing methods.

Related Organizations

University of Rhode Island
United States

Keywords

Principal Component Analysis, QH301-705.5, Bioinformatics, R, Sequence Analysis, DNA, Population structure, Polymorphism, Single Nucleotide, Genetics, Population, Population differentiation, Medicine, Humans, Population stratification, Biology (General), k-mer frequencies, k-mers

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

gold

Fields of Science (3) View all

medical and health sciences

basic medicine

Fields of Science

medical and health sciences

basic medicine

View all

Funded by

NSF| CAREER: Facilitating the use of genomic data in evolutionary biology