Data from: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards

Table S1Voucher specimens from which tissues were obtained for molecular data. Standard abbreviations are as follows: Louisiana State University Museum of Natural Science, LSU; Yale Peabody Museum, YPM; California Academy of Sciences, CAS; University of Kansas, KU; Museum of Vertebrate Zoology, University of California Berkeley, MVZ; San Diego State University, SDSU; University of Michigan Museum of Zoology, UMMZ. Non-standard abbreviations include the following: APR (Achille P. Raselimanana field series); ATOL (Assembling the Tree of Life voucher series); BPN (Brice P. Noonan field series); JAC (Jonathan A. Campbell field series); JAM (Jimmy A. McGuire field series); JJK (Jason J. Kolbe field series); JPV (John Pablo Valladares field series); LJA (Luciano J. Avila field series); SDZoo (Oliver Ryder, San Diego Zoo); POE (Steve Poe field series); RAN (Ronald A. Nussbaum field series); REE (Richard E. Etheridge field series); RLB (Robert L. Bezy field series); SBH (S. Blair Hedges field series); SDZoo (San Diego Zoo Series); TJS (Thomas J. Sanger field series); TWR (Tod W. Reeder field series), WL (William Lamar), YPM (Yale Peabody Museum). For each newly sequenced sample, a Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) and GenBank numbers are provided.Table_S1_28_July_2015.docTable S2Oligonuclotides used for construction of 43 uniquely barcoded adaptors. Index sequences (i.e., barcodes) are bolded.Table_S2_28_July_2015.docFigure S1Histograms depicting variation across ultraconserved elements (UCEs) generated for this study. (A–C) The number of segregating sites, (D–F) length in base pairs, and (G–I) number of taxa are reported for all three sampling strategies (16 taxa each with more than 3000 UCEs, 29 taxa each with more than 2000 UCES, and 44 taxa each with more than 120 UCEs) using 50% missing taxa per locus datasets. Note that datasets allowed for up to 50% missing taxa per locus and contained 2,716, 4,319, and 4,789 UCE loci total (16 taxa, 29 taxa, and 44 taxa, respectively).S1_uce_summary.pdfFigures S2-S37Phylogenetic trees from individual analyses performed in this study.Figs_S2-S37.zipRAxML_phylip_alignmentsZip archive containing the following alignments: Alignment 1: 16 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 2: 16 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 3: 16 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 4: 16 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 5: 16 taxa 60% missing taxa dataset analyzed in RAxML. Alignment 6: 29 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 7: 29 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 8: 29 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 9: 29 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 10: 29 taxa 60% missing taxa dataset analyzed in RAxML. Alignment 11: 44 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 12: 44 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 13: 44 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 14: 44 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 15: 44 taxa 60% missing taxa dataset analyzed in RAxML.Gene tree set 116 taxa 20% missing taxa dataset analyzed in NJst.16_taxa_0.20_bootstrap_zip.zipGene tree set 216 taxa 30% missing taxa dataset analyzed in NJst.16_taxa_0.30_bootstrap_zip.zipGene tree set 316 taxa 40% missing taxa dataset analyzed in NJst.16_taxa_0.40_bootstrap_zip.zipGene tree set 416 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.16_taxa_0.50_bootstrap_zip.zipGene tree set 516 taxa 60% missing taxa dataset analyzed in NJst.16_taxa_0.60_bootstrap_zip.zipGene tree set 629 taxa 20% missing taxa dataset analyzed in NJst.29_taxa_0.20_bootstrap.zipGene tree set 729 taxa 30% missing taxa dataset analyzed in NJst.29_taxa_0.30_bootstrap.zipGene tree set 829 taxa 40% missing taxa dataset analyzed in NJst.29_taxa_0.40_bootstrap.zipGene tree set 929 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.29_taxa_0.50_bootstrap_zip.zipGene tree set 1029 taxa 60% missing taxa dataset analyzed in NJst.29_taxa_0.60_bootstrap_zip.zipGene tree set 1144 taxa 20% missing taxa dataset analyzed in NJst.44_taxa_0.20_bootstrap_zip.zipGene tree set 1244 taxa 30% missing taxa dataset analyzed in NJst.44_taxa_0.30_bootstrap.zipGene tree set 1344 taxa 40% missing taxa dataset analyzed in NJst.44_taxa_0.40_bootstrap.zipGene tree set 1444 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.44_taxa_0.50_bootstrap.zipGene tree set 1544 taxa 60% missing taxa dataset analyzed in NJst.44_taxa_0.60_bootstrap.zip

Targeted sequence capture is becoming a widespread tool for generating large phylogenomic data sets to address difficult phylogenetic problems. However, this methodology often generates data sets in which increasing the number of taxa and loci increases amounts of missing data. Thus, a fundamental (but still unresolved) question is whether sampling should be designed to maximize sampling of taxa or genes, or to minimize the inclusion of missing data cells. Here, we explore this question for an ancient, rapid radiation of lizards, the pleurodont iguanians. Pleurodonts include many well-known clades (e.g., anoles, basilisks, iguanas, and spiny lizards) but relationships among families have proven difficult to resolve strongly and consistently using traditional sequencing approaches. We generated up to 4921 ultraconserved elements with sampling strategies including 16, 29, and 44 taxa, from 1179 to approximately 2.4 million characters per matrix and approximately 30% to 60% total missing data. We then compared mean branch support for interfamilial relationships under these 15 different sampling strategies for both concatenated (maximum likelihood) and species tree (NJst) approaches (after showing that mean branch support appears to be related to accuracy). We found that both approaches had the highest support when including loci with up to 50% missing taxa (matrices with ∼40–55% missing data overall). Thus, our results show that simply excluding all missing data may be highly problematic as the primary guiding principle for the inclusion or exclusion of taxa and genes. The optimal strategy was somewhat different for each approach, a pattern that has not been shown previously. For concatenated analyses, branch support was maximized when including many taxa (44) but fewer characters (1.1 million). For species-tree analyses, branch support was maximized with minimal taxon sampling (16) but many loci (4789 of 4921). We also show that the choice of these sampling strategies can be critically important for phylogenomic analyses, since some strategies lead to demonstrably incorrect inferences (using the same method) that have strong statistical support. Our preferred estimate provides strong support for most interfamilial relationships in this important but phylogenetically challenging group.

Related Organizations

University of Arizona
United States
Clarke University
United States

Keywords

missing data, Reptilia, taxon sampling, Squamata, UCEs

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	10
download	downloads	5

10
views
5
downloads
Powered by

Found an issue? Give us feedback

visibility

download

2

Average

10

5