descriptionPublicationkeyboard_double_arrow_right Article 01 Feb 2003Publisher:Oxford University Press (OUP)Journal:Systematic Biology, volume 52, pages 119-124 (issn: 1063-5157,

Authors: Michael S. Rosenberg; Sudhir Kumar;

doi: 10.1080/10635150309344 , 10.1080/10635150390132894

pmid: 12554445

pmc: PMC2796430

Taxon Sampling, Bioinformatics, and Phylogenomics

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Taxon sampling is often thought to be of extreme importance for phylogenetic inference, and increased sampling of taxa is commonly advocated as a solution to resolving problematic phylogenies. Another solution is to increase the number of sites (by sequencing additional genes) sampled for each taxon. In an ideal world, one would like to increase samples of both taxa and genes, but taxon sampling has not kept up with the pace of gene sampling increase because of the increasing ease and emphasis on genome sequencing. The question of taxon sampling is necessarily driven by resource limitation. The precise scope of “sufficient” taxon sampling is always dependent on questions being addressed. If we need to know the complete phylogeny of a genus, we must sample the genus exhaustively. In experimental design, partial sampling is an issue only when certain taxa can stand as proxies for the clades to which they belong (clade-based or stratified sampling; see Hillis, 1998). In bioinformatics studies, taxon sampling is restricted by the data availability in genetic databases (database-restricted sampling). Clearly, the nature of the problem in these two research programs is different. In stratified sampling, we are interested in knowing whether to sequence more genes per species or fewer genes for a large number of species per clade. In contrast, in database-restricted sampling it is important to know whether the overall accuracy of inferred phylogenetic trees for small taxa sets is similar to that of trees inferred from larger taxa sets. We recently addressed the issue of the database-restricted sampling (Rosenberg and Kumar, 2001) and concluded that although there was a consistent decrease in error when using more taxa, the decrease was generally minor relative to the number of taxa added to the data set. Pollock et al. (2002) challenged this conclusion by modifying our measure of the phylogenetic error. This measure, ΔE, differs from ours in that we used the difference in error between the subsampled tree [ES] and full sampled tree [EP], whereas Pollock et al. (2002) divided this difference by ES to measure the relative reduction in error. ΔE plotted against the number of additional taxa in the full sampled tree (=66 minus the number of taxa in the subsample tree) shows a clear positive effect (Pollock et al., 2002: Figs. 4, 5). Unfortunately, this impressive result brings little biological benefit, as clearly shown by a scatterplot of the average number of additional branches inferred correctly in each case (Fig. 1). In no instance are there more than 1.5 additional branches reconstructed correctly, even though the number of taxa has often increased many fold. For instance, more than doubling the number of taxa only led to an average increase of 0.7 additional correct branches (points in the middle of the x-axis in Fig. 1). This fact was clearly noted in our original article: “Note that even though ES is greater than EG and EP for very small subsamples (0.7 and >500 sites (after Pollock et al., 2002). Figure 4 Plot of the percentage of times interordinal branches were reconstructed correctly when the total number of bases was held constant. In each comparison, the data set with fewer taxa (and more sites per taxon) is always plotted on the x-axis. The dotted ... Zwickl and Hillis (2002) also challenged conclusions reached by Rosenberg and Kumar (2001) by using the concept of tree diameter (the maximum distance between all pairs of taxa) to partition genes with different subsampled sets of taxa for analysis. They showed that four-taxon subsamples with a smaller tree diameter generate more accurate results than those subsamples with larger tree diameters. This result is expected because, with sequence divergence and length kept constant, the larger diameter four-taxon trees will encompass higher average divergence and would thus involve larger estimation errors. Furthermore, for the simulations involving the model tree in Figure 2a, four-taxon data sets containing sequences with larger diameters would include interordinal relationships (with many small interior branches) more frequently than would small diameter samples (see also Zwickl and Hillis, 2002: Fig. 3a). Therefore, Zwickl and Hillis’s study is an examination of the phylogenetic error at different evolutionary divergence cross sections of the phylogenetic tree specifically simulated. This and the complete absence of resource limitation (a must for any sampling study) clearly establish that Zwickl and Hillis have not evaluated either stratified or database-restricted taxon-sampling problems. Therefore, Zwickl and Hillis were not correct in stating that their results are in contradiction with our previous results (Rosenberg and Kumar, 2001). In fact, Zwickl and Hillis’s results represent another facet of statistical analysis of the same data. Also, Zwickl and Hillis took issue with our choice of a fast heuristic search used in computer simulations (Rosenberg and Kumar, 2001). We chose this strategy based on results of multiple previous studies, which showed that the most optimal tree is often more optimal than the true tree and that the fast and more exhaustive searches produce trees with comparable phylogenetic errors (Kumar, 1996; Nei et al., 1998; Takahashi and Nei, 2000). Zwickl and Hillis found that with the maximum parsimony (MP) method for the given data set, the TBR searches produced topologies that had less error than those from NNI. This result (based on a single simulation data set) seems to be in conflict with previous studies. We plan to evaluate this result more thoroughly analytically and by computer simulation in the future. Figure 2 Model tree for the simulations based on the Eutherian mammal tree from Murphy et al. (2001) and Eizrik et al. (2001). (a) Full 66-taxon tree; interordinal relationships are represented by thick branches designated with letters. (b) Phylogenetic relationships ... Figure 3 Plot of the percentage of times the interordinal branches were reconstructed correctly in 66-taxon trees versus n-taxon trees, where n = 15, 30, and 45. These values are for all genes and all replicates. The dotted lines indicate a 1:1 relationship. Analyses ... However, we extrapolated our database-restricted sampling and random sampling results to conclude that the phylogenetic trees with fewer taxa but large numbers of genes per taxon may be more accurate than those with many taxa but fewer genes (Rosenberg and Kumar, 2001). Neither Pollock et al. (2002) nor Zwickl and Hillis (2002) addressed that issue, which lies at the heart of the experimental design. Here, we tackle this issue along with biological relevance of many other assumptions made and conclusions reached by Rosenberg and Kumar (2001) that Zwickl and Hillis (2002) objected to. We show that the conclusions reached by Rosenberg and Kumar (2001) are applicable for both phyloinformatic and phylogenomic studies.

Related Organizations

Arizona State University
United States

Keywords

Models, Statistical, Sample Size, Computational Biology, Genomics, Algorithms, Phylogeny

1 Research products, page 1 of 1

Are Unequal Clade Priors Problematic for Bayesian Phylogenetics?
2006IsAmongTopNSimilarDocuments

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	122
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 1%