Simulated datasets used in our paper "CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models" to produce figures 1bc, 1d, and 2ab. The data provided in each folder is as follows: rate_matrices contains the classical LG rate matrix, and our 400 x 400 estimated co-evolutionary model Q2. fig_1bc contains the simulated data used to estimate and evaluate rate matrices using the CherryML method and EM (with XRATE) as shown in Fig. 1b and c of our paper. The files and sub-directories here are: fig_1bc_simulated_data_families_all.txt contains the list of protein family names used to train the model. When only K families are used in Fig. 1b and c, these are the first K families of this list. gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, without site rate variation. gt_site_rates_dir contains the site rates used. In this case, they are all 1. gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. fig_1d folder contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are: gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, with site rate variation. gt_site_rates_dir contains the site rates used. gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. fig_2ab contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are: gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each non-contacting tree, and using Q2 for the contacting sites, all without site rate variation. gt_site_rates_dir contains the site rates used, in this case all 1 (i.e. no site rate variation). gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. contact_map_dir contains the simulated contact maps for each family. These were obtained by computing a maximal matching on the true contact maps derived from the trRosetta paper, as described in detail in out paper. The exact end-to-end code which generates these simulated datasets is provided in our Github repository: https://github.com/songlab-cal/CherryML In fact, by default, when you try to reproduce the figures in our paper by running the `reproduce_all_figures.py` script in our repository, the data will automatically be simulated for you if it isn't already present. This can be bypassed by downloading the data here in Zenodo and changing the top of `reproduce_all_figures.py` to point to these files.

Related Organizations

University of California, Berkeley
United States

Keywords

Phylogenetics

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	16
download	downloads	4

16
views
4
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average