Data for: PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets

Rusinko, Joseph; Cai, Yu; Crysler, Allison; Thompson, Katherine; Boutte, Julien; Fishbein, Mark; Straub, Shannon

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Other ORP type . 2025

License: CC BY

Data sources: ZENODO

ZENODO

Other ORP type . 2025

License: CC BY

Data sources: Datacite

ZENODO

Other ORP type . 2025

License: CC BY

Data sources: Datacite

Data for: PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets

appsOther research productkeyboard_double_arrow_right Other ORP type 23 Jun 2025Publisher:Zenodo

Authors: Rusinko, Joseph; Cai, Yu; Crysler, Allison; Thompson, Katherine; Boutte, Julien; Fishbein, Mark; Straub, Shannon;

doi: 10.5281/zenodo.15693099 , 10.5281/zenodo.15693100

Data for: PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets

- Summary
- Subjects
- Metrics

Abstract

We obtained targeted sequence data for 763 putatively single-copy nuclear loci for samples of 59 North American milkweed species, three African outgroup species, \textit{Asclepias physocarpa}, \textit{A. fruticosa}, and \textit{A. fornicata}, and one additional outgroup, \textit{Pergularia daemia} using the target enrichment baits of Weitemier et al. (2014) (Supplemental Material~\protect\ref{app:milkweed}). Data for 32 of these samples and orthologs from the genome sequence of \textit{Asclepias syriaca} \citep{weitemier2019draft} were included in the analyses of \cite{BOUTTE2019106534}, and nuclear sequence data for the additional 30 samples were generated using the DNA sequencing and assembly methods described therein. \cite{BOUTTE2019106534} had excluded the 30 newly analyzed samples based on an ad hoc minimum gene recovery criterion of 600 genes (79\%) with the goal of high gene occupancy for all samples for species tree analyses. For the analyses conducted here, we masked assembled sequences with Ns for very low read depth ($\le 2$ reads) and at heterozygous sites (i.e., intra-individual SNPs). For each gene, we aligned masked sequences using Mafft version 7.245 with default parameters \citep{katoh2013mafft}, and removed sequences with less than 50\% of the total alignment length \citep[i.e. Type II missing data;][]{hosner2016avoiding} to reduce gene tree error following \cite{sayyari2017fragmentary} and \cite{mirarab2019species}. For further analysis, we selected a subset of 703 genes, which had been identified by \cite{BOUTTE2019106534} as producing the best-resolved milkweed phylogenies based on bootstrap support across the gene trees. For the complete data set of 63 species, we first estimated the 703 gene trees using Neighbor-Joining on uncorrected distances (the proportion of observed differences in the aligned sequences) as implemented in the ape package \citep{paradis2018ape} in R v. 3.5.1 \citep{R}. Using these estimated gene trees, we identified the samples to be included in species tree analyses using \emph{PickMe}. To determine whether the gene tree inference method affected the sample selection results, we also used the GTR+Gamma model in RAxML v. 8.2.12; \citep{stamatakis2014raxml} to estimate the initial gene trees. For the set of samples identified as reliable by \emph{PickMe}, we realigned the sequences and then removed small alignments ($< 100$ bp) following \cite{BOUTTE2019106534}. We then used IQ-Tree v. 1.5.4 \citep{nguyen2014iqtree,chernomor2016terrace} to select the best model of molecular evolution for the retained alignments and inferred the gene tree for each locus using the same parameters as \cite{BOUTTE2019106534}. Using ASTRAL-II v. 4.10.12 \citep{mirarab2015astral} with default parameters, we inferred a species tree and calculated local posterior probability support \citep{sayyari2016fast}. We calculated gene concordance factors using the method of \cite{Minh2020new}, implemented in IQ-Tree v. 2.1.2 \citep{nguyen2014iqtree,chernomor2016terrace}. For comparison, we repeated the gene and species tree analyses done for the subset of \textit{PickMe} reliable samples for the full data set using identical methods.

After collecting large data sets for phylogenomics studies, researchers must decide which genes or samples to include when reconstructing a species tree. Incomplete or unreliable data sets make the empiricist's decision more difficult. Researchers rely on ad hoc strategies to maximize sampling while ensuring sufficient data for accurate inferences. An algorithm called PickMe formalizes the sample selection process, assuming that the samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the species tree displays this topology. From this, we assign individual samples reliability scores computed as the average of a scaled version of the posterior probabilities. PickMe uses these weights to recommend which samples to include in a species tree analysis. Analysis of simulated data showed that including the samples suggested by \textit{Pickme} produced species trees closer to the true species trees than both unfiltered data sets and data sets with ad hoc gene occupancy cut-offs applied. To further illustrate the efficacy of this tool, we apply PickMe to gene trees generated from target capture data from milkweeds. PickMe indicates more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed without access to a formal methodology for sample selection. Using simulated and empirical data, we also compare \emph{PickMe} to existing sample selection methods. Inclusion of PickMe will enhance phylogenomics data analysis pipelines by providing a formal structure for sample selection.

Funding provided by: National Science FoundationROR ID: https://ror.org/021nxhr62Award Number: 1616186 Funding provided by: National Science FoundationROR ID: https://ror.org/021nxhr62Award Number: 1457510 Funding provided by: National Science FoundationROR ID: https://ror.org/021nxhr62Award Number: 1457473 Funding provided by: National Science FoundationROR ID: https://ror.org/021nxhr62Award Number: 1929284

Related Organizations

University of Kentucky
United States
Hobart and William Smith Colleges
United States
Oklahoma State University
United States

Keywords

Apocynaceae, Gene tree, milkweed, Phylogenomics, sample selection, Asclepias, Bayes factor

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average