<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

github.com/aofarrel/tree_nine/tree_nine

Name: github.com/aofarrel/tree_nine/tree_nine
Creator: Ash O'Farrell
Keywords: phylogenetics, usher, taxonium, nextstrain, auspice, pathogen

integration_instructionsResearch softwarekeyboard_double_arrow_right Software 18 Mar 2025Publisher:Zenodo

Authors: Ash O'Farrell;

github.com/aofarrel/tree_nine/tree_nine

- Summary
- Subjects
- Metrics

Abstract

# Tree Nine Put diff files on an existing phylogenetic tree using [UShER](https://www.nature.com/articles/s41588-021-00862-7)'s `usher sampled` task with a bit of help from [SRANWRP](https://www.github.com/aofarrel/SRANWRP), followed by conversion of that tree to Taxonium, Newick, and Nextstrain formats. Samples' SNP distance is calculated and output as a distance matrix, and samples will be placed into clusters based on the distance. Verified on Terra-Cromwell and miniwdl. Make sure to add `--copy-input-files` for miniwdl. Default inputs assume you're working with _Mycobacterium tuberculosis_, be sure to change them if you aren't working with that bacterium. This repo also contains the following subworkflows: * [Annotate](./annotate.md) * [Convert to Nextstrain](./convert_to_nextstrain.md) (for viewing in Auspice, non-clade sample annotations, etc) * [Extract](./extract.md) * [Mask tree](./mask_tree.wdl) * [Mask subtree](./mask_subtree.wdl) * [Summarize](./summarize.md) ## features * Highly scalable, even on lower-end computes * Can input a single pre-combined diff file * Includes a sample input tree created from SRA data if no input tree is specified * Trees automatically converted to UsHER (.pb), Taxonium (.jsonl.gz), Newick (.nwk), and Nextstrain (.json) formats * Automatic clustering based on configurable genetic distance * Nextstrain tree(s) will be annotated by cluster * Clustering can be limited to only samples specified by the user, all newly added samples, or all samples * Clustering is also performed after backmasking * (optional) Create per-cluster Nextstrain subtrees * (optional) Reroot the tree to a specified node * (optional) Backmask newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files * Designed for highly clonal samples which have a plausible direct epidemiological relationship * Backmasking can only be performed on samples which have a sample-level diff files * (optional) Summarize input, reroot, and output trees with matutils * (optional) Filter out positions by coverage at that position and/or entire samples by overall coverage * (optional) Specify your own reference genome if you don't want to work with H37Rv * (optional) Annotate clades via matutils with a specified annotation TSV ## benchmarking Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering finding, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro. Backmasking is the least scalable part of the pipeline. The comparison itself theoretically scales n2 and once the comparison is completed, n backmasked disk files must be written to the disk. We have observed that memory problems tend to arise during the file-writing part when n≥55 on a local machine. Runtime attributes are adjustable as task-level variables to aid with scaling on cloud backends, although we have seen the default handle 60 samples at a time without much issue.

Related Organizations

University of California, Santa Cruz
United States

Keywords

phylogenetics, usher, taxonium, nextstrain, auspice, pathogen

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average