Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

Authors: Prillo, Sebastian; Deng, Yun; Boyeau, Pierre; Xingyu Li; Po-Yen Chen; Song, Yun S.;

CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models

Abstract

Simulated datasets used in our paper "CherryML: Scalable Maximum Likelihood Estimation of Phylogenetic Models" to produce figures 1bc, 1d, and 2ab. The data provided in each folder is as follows: rate_matrices contains the classical LG rate matrix, and our 400 x 400 estimated co-evolutionary model Q2. fig_1bc contains the simulated data used to estimate and evaluate rate matrices using the CherryML method and EM (with XRATE) as shown in Fig. 1b and c of our paper. The files and sub-directories here are: fig_1bc_simulated_data_families_all.txt contains the list of protein family names used to train the model. When only K families are used in Fig. 1b and c, these are the first K families of this list. gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, without site rate variation. gt_site_rates_dir contains the site rates used. In this case, they are all 1. gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. fig_1d folder contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are: gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each tree, with site rate variation. gt_site_rates_dir contains the site rates used. gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. fig_2ab contains the simulated data used to evaluate the effect of time quantization on the CherryML method as shown in Fig. 1d of our paper. The files and sub-directories here are: gt_tree_dir contains the phylogenetic tree used to simulate data for each protein family. There were originally estimated running FastTree on the MSAs from the trRosetta paper, as described in our paper in detail. msa_dir contains the simulated multiple sequence alignments (MSAs). These were simulated running the LG rate matrix down each non-contacting tree, and using Q2 for the contacting sites, all without site rate variation. gt_site_rates_dir contains the site rates used, in this case all 1 (i.e. no site rate variation). gt_likelihood_dir contains the log-likelihood of the original phylogenetic trees used for each family (as given by FastTree). This is irrelevant for but provided for completeness; you can safely ignore this directory. contact_map_dir contains the simulated contact maps for each family. These were obtained by computing a maximal matching on the true contact maps derived from the trRosetta paper, as described in detail in out paper. The exact end-to-end code which generates these simulated datasets is provided in our Github repository: https://github.com/songlab-cal/CherryML In fact, by default, when you try to reproduce the figures in our paper by running the `reproduce_all_figures.py` script in our repository, the data will automatically be simulated for you if it isn't already present. This can be bypassed by downloading the data here in Zenodo and changing the top of `reproduce_all_figures.py` to point to these files.

Related Organizations
Keywords

Phylogenetics

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 16
    download downloads 4
  • 16
    views
    4
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
16
4