uDANCE version 1.6.3
https://github.com/balabanmetin/uDANCE-data
uDANCE is highly-scalable end-to-end workflow for inferring phylogenomic trees or updating existing ones. The input to uDANCE is a backbone tree, a set of DNA xor amino-acid multiple sequence alignments (MSAs) of backbone sequences, and new (query) sequences. Alternatively, when a backbone tree is not available, uDANCE can select a set of backbone species with high diversity to reconstruct a backbone tree. At a high level, uDANCE inserts the query sequences on the backbone tree independently and then refines the tree locally in different parts. The backbone is allowed to change based on the new information provided by the query sequences, but uDANCE also outputs an incremental tree with the backbone relationships fixed for users that require consistency between updates in their analyses. Since uDANCE aims for automatic analyses of large data, it has many built-in quality control and filtering strategies; it may decide a set of query sequences cannot be confidently put in the output tree (i.e., are unplacable) and some backbones need to be removed. When more sequences become available, the output from the previous iteration can be used as the input in the next iteration to incrementally grow the tree.
- Linux or OS X (Intel or Apple chip)
- Anaconda
- Clone the repository
cd uDANCE
bash install.sh
conda activate uDANCE
If you want to use raxml-ng in your workflow, please install raxml-ng manually and make sure the executable raxml-ng
is available in your path.
You can run uDANCE on the test dataset with 10 gene multiple sequence alignments, 99 backbone sequences, and 146 query sequences using the following command. The command should complete in several minutes using 4 CPU cores:
snakemake -c 4 --configfile config.yaml --snakefile uDANCE.smk all
The specification of the run is defined in the config file config.yaml
. In the config.yaml
, the working directory is specified using the field workdir
. Input multiple sequence alignments should be located under <workdir>/alignments
. For example, for the test dataset, the working directory is datasmall
and the input alignments are located at datasmall/alignments
.
If there is an input backbone tree as well, it should be locate it at <workdir>/backbone.nwk
and backbone
field in the config file should be set to "tree"
.
uDANCE workflow is implemented using Snakemake. Below is the diagram of the rule graph showing the relationship between each stage in the workflow.
Brief summary of each rule:
Rule | Description | #instances | |
---|---|---|---|
1 | trimtaper/trimcollect | Alignment trimming and error correction using ASTER | #genes |
2 | mainlines | Selection of a subset of sequences for backbone using Mainlines algorithm | 1 |
3 | prepbackbonegenes | Creating one partition for the selected backbone sequences | #genes |
4 | genetreeinfer | Gene tree inference (RAxML, IQTree-2, or RAxML-NG) | #genes × #partitions |
5 | refine | ASTRAL species tree inference | #partitions |
6 | placement_prep | Backbone quality control, filtering, and MSA concatenation of phylogenetic placement | 1 |
7 | placement | Phylogenetic placement of query sequences onto backbone tree using APPLES-2 | 1 |
8 | decompose | Placement tree decomposition algorithm and creation of partitions | 1 |
9 | stitch | Stitching algorithm for partition species trees | 1 |
Note that genetreeinfer
and refine
rules is used in two separate parts of the workflow: once to obtain the backbone tree (de-novo) and once on every the partition created after phylogenetic placement.
uDANCE writes its output under the directory <workdir>/output
. The workflow outputs three phylogenies: <workdir>/output/uDANCE.maxqs.nwk
, <workdir>/output/uDANCE.incremental.nwk
, and <workdir>/output/uDANCE.updates.nwk
. incremental
tree guarantees that the backbone topology is fixed in the output tree. maxqs
tree is the "best" one inferred by uDANCE and the location of backbone sequences might change after insertion of query sequences. We will shortly come back to the exact definition of these three output trees.
uDANCE workflow also outputs many intermediate files. These intermediate files can be useful for user for debugging as well as to supplement the downstream analysis. All paths below are given in relative to the output directory <workdir>/output
.
trimmed
containes trimmed input MSAs.backbone.nwk
is the backbone file. If there is an input backbone tree, it's identical to the input backbone tree. Otherwise, backbone sequences are selected using Mainlines algorithm and selected sequence IDs are written tobackbone/0/species.txt
. Gene trees for the selected sequences are found in the directorybackbone/0/<gene>
. This directory contains subdirectories named1
tok
, wherek
is the number of starting trees for the inference of the gene tree for this gene.backbone/0/<gene>/<i>/shrunk.fasta.treefile
is the inferred maximum likelihood tree for the gene using the starting treei
. The highest likelihood tree among allk
starts isbackbone/0/<gene>/bestTree.nwk
and relative-path to the starting tree that yielded the highest likelihood isbackbone/0/<gene>/bestTreename.txt
. All gene trees in newick format are written tobackbone/0/astral_input.trees
, one line per tree. This file is provided to ASTRAL as the input. No constraint tree is used during the estimation of the backbone tree. ASTRAL's output file (a newick tree) isbackbone/0/astral_output.updates.nwk
.placement
contains the backbone and query alignments and the backbone tree used at the placement stage.placement.jplace
is the jplace file output by APPLES-2.- The partitions created by uDANCE are located under the directory
uDANCE
. This directory contains subdirectories named0
top-1
, wherep
is the number of partitions designated for the uDANCE run.atasmall/output/uDANCE/<partition>/species.txt
contains the list of backbone, query, and outgroup sequences in the partition. The organization of the partition is almost identical to the "partition"backbone/0
given in the item 2 above but there are a few differences.datasmall/output/uDANCE/<partition>/astral_constraint.nwk
anddatasmall/output/uDANCE/<partition>/raxml_constraint.nwk
are the two kinds of constraint trees used in ASTRAL stage. The former results in an ASTRAL tree (datasmall/output/uDANCE/<partition>/astral_output.incremental.nwk
) that retains the backbone tree topology and the latter allows topological changes aming the backbone sequences (datasmall/output/uDANCE/<partition>/astral_output.updates.nwk
). - (continued) The spanning tree of partitions of the placement tree is available at
datasmall/output/uDANCE/color_spanning_tree.nwk
. This can be regarded as the hierarchy or relative positions of the partitions.datasmall/output/uDANCE/outgroup_map.json
is a dictionary where, for each partition, we list the outgroup sequences (stored in keyschildren
andup
). These two files are used during stitching ASTRAL output trees of the partitions.
Parameter | Description |
---|---|
chartype | Amino-acid or nucleotide characters |
backbone | Three backbone tree source options. (1) de-novo, (2) user tree, and (3) user list |
resources.large_memory | Large memory jobs memory limit (MB) |
resources.cores | Large memory jobs CPU cores limit |
trim_config.percent_nongap | Sites with less non-gap fraction than below is removed |
mainlines_config.n | Target number of backbone taxa in de novo inference |
mainlines_config.length | concatenation alignment length |
backbone_filtering | Backbone filtering is recommended if backbone contains misplaced or noisy sequences |
apples_config.method | APPLES-2 placement mode |
apples_config.filter | APPLES-2 -f parameter (filter diameter) |
apples_config.base | APPLES-2 -b parameter (minimum observations) |
apples_config.overlap | APPLES-2 minimum alignment overlap fraction |
prep_config.edge_thr | Partition diameter limit |
prep_config.cluster_size | Approximate partition size. Options are (1) auto, (2) fast, (3) user defined integer. |
prep_config.sublength | Minimum partition alignment length. If not satisfied, the partition is discarded. |
prep_config.pruneafter | Maximum partition size |
prep_config.min_placements | The maximum number of placements occurred for the partition to be skipped to save running time |
infer_config.method | Gene tree inference method (RAxML-ng, IQTree-2, or RAxML-8) |
infer_config.numstart | The number of starting trees |
infer_config.num_threads | The number of threads used in gene tree inference |
refine_config.contract | Contract low support (IQTree-2 aBayes test) branches |
refine_config.occupancy | Gene occupancy threshold for sequence inclusion |
refine_config.outlier_sizelimit | The next two are 1D k-means-based (k=2) outlier gene detection parameter. Limit for outlier size fraction. |
refine_config.outlier_difference | Centroid difference must be larger than this value to designate the first cluster as the outlier set |
refine_config.infer_branchlen | Infer branch lengths in substitution unit using ASTRAL |
Metin Balaban
Yueyu Jiang
1.6.4
- Do not transfer branch supports after ASTRAL branch length estimation
1.6.3
-
Cluster size can be automatically set
-
Bug fix
1.6.2
- Backbone filtering can be turned on or off
1.6.1
- Benchmarks directives are used in uDANCE.smk
1.6.0
-
Skip a partition (return backbone) if the number of placements is it less than a desired number
-
Bug fixes
1.5.1:
- Output tree may have branch lengths in substitution unit if desired.
1.5.0:
- uDANCE uses ASTRAL version 5.17.2, which supports both multithreading and constrained search.
1.4.1:
-
Performance improvements in placement_prep
-
Keep placement_prep intermediate files.
1.4.0:
- Stitching algorithm outputs a new tree named maxqs which picks the best one of the two ASTRAL trees for each partition.
1.3.3:
-
Default decompose edge threshold changed
-
Bug fixes
-
Large cluster pruning strategy is changed to serial search (instead of binary)
1.3.2:
-
Expose backbone selection strategy to the user
-
Expose gene tree filtering parameters at refine stage to the user
1.3.1:
- Getting rid of pruning thresholds (automated finding)
1.3.0:
-
Pruning Large partitions
-
Changes in TreeCluster logic disallowing formation of very small partitions
1.2.1:
-
APPLES2 excludes sequences that are placed on internal nodes.
-
Filtered backbone sequences are no longer added to the query set.
-
Set min_placements: 9999999 to only filter out low quality backbone sequences and return.
-
Set config["mainlines_config"]["n"] to number of species in the dataset and config["backbone"] to "de-novo". Then run snakemake with target {outdir}/backbone.nwk. This will do species tree inference without divide and conquer.