
Species limits are traditionally determined based on morphological, behavioral, and ecological traits. In recent years, genetic sequence data have increasingly been used to delimit species due to the advancement of sequencing technologies and development of statistical methods of data analysis (Wiens 2007; Fujita et al. 2012). Early methods relied on reciprocal monophyly in the reconstructed gene trees, fixed sequence differences between putative species, or simple cut-offs on migration rates or genetic distances between putative species (Sites and Marshall 2004). More recent methods are based on the multispecies coalescent model (Rannala and Yang 2003) and avoid arbitrary cut-offs (Knowles and Carstens 2007). Among the recent methods, the Bayesian method of Yang and Rannala (2010) has a number of advantages over its competitors (Fujita and Leache 2011). The Bayesian method uses Bayesian model selection to compare different species-delimitation models in the multispecies coalescent framework, and uses reversible-jump Markov chain Monte Carlo (rjMCMC) to estimate the posterior probabilities for different delimitation models. The method accommodates multiple loci, and does not require reciprocal monophyly of inferred gene trees. The underlying multispecies coalescent model accounts for incomplete lineage sorting and species-tree–gene tree conflicts due to ancestral polymorphism. The likelihood calculation on sequence alignments allows the method to make a full use of the information in the data while accounting for the uncertainties in the gene tree topologies and branch lengths. Compared with traditional morphology-based taxonomic practice, which varies widely across taxonomic groups, the Bayesian method infers species status from a genealogical and population genetic perspective and is arguably more objective (Fujita and Leache 2011; Fujita et al. 2012). In computer simulations, the Bayesian method was found to have good statistical properties (Leache and Fujita 2010; Zhang et al. 2011; Camargo et al. 2012), with low false positives (the error of splitting one species into two) and false negatives (the error of failing to recognize distinct species). Simulations also suggest that the method has good power in identifying distinct species in the presence of small amounts of gene flow, and is not misled to infer geographical populations as distinct species when the migration rate is high (Zhang et al. 2011). To reduce the space of models to be evaluated in the rjMCMC, the implementation of (Yang and Rannala 2010; Rannala and Yang 2013) in the program bpp (for Bayesian Phylogenetics and Phylogeography) requires the user to specify a rooted phylogeny for the populations, called the guide tree. The program then evaluates only those models that can be generated by collapsing nodes on the guide tree. The program currently does not change the relationships among the populations, nor does it split a population into different species. As a simple evaluation of the impact of the guide tree on species delimitation by bpp, Leache and Fujita (2010) randomized the populations at the tips of a 10-population guide tree for West African forest geckos and found that the incorrect guide tree caused bpp to over-split. When closely related populations that belong to the same species are incorrectly separated on the guide tree and are grouped with more distant populations, bpp tends to infer all of them as distinct species. However, the analysis of Leache and Fujita (2010) is on a small scale, and furthermore, the random guide trees generated by permutation may be too wrong, unlikely to be encountered in real data analysis when the guide tree is estimated from real data. Here, we conduct a simulation study to examine the performance of the method under more realistic scenarios, that is, when the guide tree is inferred from the sequence data. A number of heuristic methods have been used to construct the guide tree, including: a) clustering algorithms such as structure (Pritchard et al. 2000; Falush et al. 2003), structurama (Huelsenbeck and Andolfatto 2007), or baps (Corander et al. 2004), which can assign individuals to populations and even infer a population tree. Those methods are often applied to microsatellite data or single-nucleotide polymorphisms (SNPs). b) phylogenetic methods such as RAxML (Stamatakis 2006) and MrBayes (Ronquist et al. 2012) applied to either a mitochondrial locus or concatenated nuclear loci. c) species-tree methods such as best (Liu 2008) or *beast (Heled and Drummond 2010) applied to multiple nuclear loci. d) species-discovery methods such as that of O'Meara (2010). e) empirical population phylogeny based on geographical distributions or morphological and ecological characters. A useful review of strategies for generating the guide tree used in recent studies of species delimitation by bpp has been provided by Carstens et al. (2013, table 1). Geographical distributions and morphological and ecological features of the populations are always important to defining putative species. However, it is difficult to consider such information in a simulation. In this study, we examine strategies b and c for obtaining a guide tree by analyzing DNA/RNA sequence data. The first approach we examine (strategy b) uses phylogenetic analysis of a mitochondrial locus. Note that in vertebrates, the mitochondrial genome has a much higher mutation rate than the nuclear genome so that the sequence data are more variable and more informative (e.g., Zhou et al. 2012). Furthermore, the effective population size for a mitochondrial locus is only one-fourth that for a nuclear locus, so that incomplete lineage sorting is less likely to occur and the mitochondrial gene tree is more likely to match the species/population phylogeny. This method has been used by Leache and Fujita (2010), Hamback et al. (2013), Linde et al. (2014), among others. We use the program RAxML (Stamatakis 2006) to infer the unrooted maximum-likelihood (ML) tree and mid-point rooting to generate the rooted tree to be used as the guide tree for bpp. The program is widely used and provides a fast method to infer gene trees using ML. We also used the Bayesian method to infer rooted gene trees for the mitochondrial locus under the molecular clock, using the program beast (Drummond and Rambaut 2007), but we expect the results to be similar to the ML method. Table 1. Parameter values used in simulating sequences at the nuclear loci The second approach we examine (strategy c) is use of species-tree methods applied to multiple nuclear loci. We use *beast (Heled and Drummond 2010) for this purpose. We note that it is possible to apply a traditional phylogenetic method such as ML to the concatenated nuclear data, but concatenation is in general inferior to species-tree methods based on the multispecies coalescent model (see Degnan and Rosenberg [2009] and Edwards [2009] for reviews). The strategy of using *beast to infer the guide tree for species delimitation by bpp has been used by Leache and Fujita (2010), Linde et al. (2014), Satler et al. (2013), among others. To keep the complexity of our simulation manageable, we do not consider the problem of assignment errors in this study and assume that the individuals are correctly assigned to the populations (see discussions later).
Models, Genetic, Genetic Speciation, Bayes Theorem, Classification, Mitochondria, Points of View, Sample Size, Phylogeny
Models, Genetic, Genetic Speciation, Bayes Theorem, Classification, Mitochondria, Points of View, Sample Size, Phylogeny
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 53 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
