-
Notifications
You must be signed in to change notification settings - Fork 2
Tree Nine ‐ Overview
Tree Nine, named after a Douglas Fir in UCSC's Upper Campus forestland, is a phylogenetics workflow. Its main goals are:
- Take in an array of MAPLE-formatted diff files (discrete files, already concatenated, or both)
- Put your samples onto a phylogenetic tree using UShER
- Run matOptimize on the resulting UShER tree
- Convert the resulting UShER tree into more widely compatible formats, such as taxonium-compatiable JSONL and Auspice-compatible JSON
- Cluster your samples into 20 SNP, 10 SNP, and 5 SNP clusters for pathogen tracking
Although Tree Nine is preconfigured for tuberculosis, it can in theory be used for basically any organism by simply passing in a different reference genome and base tree, although this isn't officially supported.
This is intended to be detailed documentation -- if you are just looking for single-sentence quick descriptions, please see the parameter_meta arguments on the Tree Nine WDL script itself.
At an absolute minimum you will need Array[File] diffs which is an array of MAPLE-formatted diff files. Each diff file in this array is expected to be single-sample. Tree Nine will take in these diffs, extract their sample names, and output a single concatenated multi-sample diff file which will later be passed into UShER for processing.
If you already have a preconcatenated diff file of multiple samples, you will want to set that to File? existing_diffs, as well as provide a newline-delimited list of its sample IDs via File? existing_samples.
If File? existing_diffs, File? existing_samples, and Array[File] diffs are all provided, the single-sample diffs in Array[File] diffs will be concatenated to the bottom of File? existing_diffs, and the single-sample diff's sample IDs will likewise be concatenated to the end of File? existing_samples.
File? input_tree is the base tree that will be passed into UShER. If not provided, a rudimentary 7K sample tree of some MTBC samples from SRA will serve as the base tree. This default base tree doesn't represent genetic diversity of MTBC well, has not undergone pruning, was made with a very old version of myco that skipped some sample cleaning steps, and should not be used for anything besides quick testing.
If you do NOT intend on using H37Rv as your reference genome, you will need to fill in the optional File? ref_genome argument. If not provided it will provide a copy of H37Rv that's already included in the Docker image.
Note
In the future, Tree Nine will support pulling sample-level metadata directly from a Terra data table and embedding that into phylogenetic trees. For this use case, the Array[String] entity_ids being set to the entity:xxxxx_id column will be a hard requirement to avoid the Terra data table misalignment issue. It will additionally expected that your sample IDs, as defined by their header row in their diff file, perfectly match their entity_id value.
Boolean identify_clusters controls if the clustering logic will run at all, while Boolean upload_clusters_to_microreact controls whether clusters are uploaded to Microreact. If you are debugging the clustering logic, it is highly recommended to set Boolean upload_clusters_to_microreact to false to prevent any potential mismatch between your local cluster information versus what's available on Microreact.
Caution
Uploading to Microreact requires a Microreact API key. Never put this key inside a Docker image. It's also not a good idea to pass it in directly as a raw string, as this will cause it to be visible on Terra's logs.
Boolean cluster_entire_tree is a highly experimental option to attempt to cluster ALL samples on the resulting .pb file from UShER, including samples that were on the base tree (File? input_tree) as opposed to being added via Array[File] diffs or File? existing_diffs. This feature is known to work in earlier versions of the pipeline, but due to the sheer time and cost requirements of this, it has not been invoked in quite some time and should not be enabled without careful consideration.
The way we cluster things for tuberculosis is very different from how we cluster things for SARS-CoV-2, but they use similar logic for maintaining persistent cluster IDs. Just like the covid pipeline, if you are trying to keep consistent IDs for your clusters, you will need sample-indexed File? persistent_cluster_meta TSV and a sample-indexed File? persistent_cluster_ids TSV to properly track your samples -- please see previous runs for precise format details. The tuberculosis pipeline additionally requires File? previous_run_cluster_json from your most recent preceeding run of Tree Nine.
Tip
Tree Nine currently does not support "starting over" cluster IDs, but you are free to modify File? persistent_cluster_ids, File? persistent_cluster_meta, and File? previous_run_cluster_json to remove all information about the last run's clusters. This will essentially cause it to consider everything to be a new cluster. Be warned, however, this will NOT update nor delete any existing Microreact projects that may have the old information!