Decomposition Into Single-COpy gene trees (DISCO) is a method for decomposing multi-copy gene-family trees while attempting to preserve orthologs and discard paralogs. These single-copy gene trees can be subsequently used by methods that can estimate species trees from single-copy gene trees such as ASTRAL or ASTRID in order to obtain an accurate estimation of the species tree.
Given a list of multi-copy gene trees, DISCO does the following for each tree:
- Root the tree and tag each internal vertex as either a duplication event or a speciation event in such a way that minimizes the total number of duplications and losses. We do this with the ASTRAL-Pro rooting and tagging algorithm (Zhang et. al. 2020).
- Decompose gene tree by splitting off the smallest subtree under every vertex tagged as a duplication from the bottom up until all duplication events are resolved; it returns the set of single-copy trees produced.
- Python 3
- TreeSwift
Treeswift can be installed with: pip install treeswift
Input: File containing list of multi-copy trees in newick format
Output: File containing resulting list of single-copy trees after decomposition in newick format
python3 disco.py -i <input_file> -o <ouput_file> -d <delimiter>
- Required
-i
: Input newick tree file
- Optional
-o
: Output newick tree file-d
: Delimiter separating species name from rest of leaf label. Default None.-s
: Output only single tree (discarding smallest duplicate clades).-m
: Minimum number of taxa required for tree to be outputted. Default 4.-n
: No decomposition (outputs rooted gene trees).-v
: Enable verbose output-rp
: Remove in-paralogs before rooting/scoring (does not affect output, only reported score)--outgroups
: Write outgroups (including ties) to txt file. (Might make program slower).
python3 tag_decomp.py -i example/gtrees-mult.trees
Input: File containing list of multi-copy trees in newick format and set of alignment files in phylip format corresponding to the gene families.
Output: Concatenated alignment file in the phylip format
python3 ca_disco.py -i <input_trees> -a <alignments_list> -t <taxa_list> -o <output> -d <delimiter> -m <n>
disco.py
must be present in the same directory as ca_disco.py
in order for it to run. Also, unlike disco.py
, it is necessary for the input newick trees given to ca_disco.py
to have unique leaf labels where the taxon name comes first and is separated from the rest of the name by some delimiter.
- Required
-i
: Input newick tree file-a
: Text file containing paths to alignment files (one path for line, each path corresponding to gene-family tree on the same line in teh input tree file)-t
: Text file containing taxa list (one taxon per line)-o
: Output concatenated alignment file
- Optional
-m
: Minimum number of taxa required for tree to be outputted. Default 4.-d
: Delimiter separating species name from rest of leaf label. Default _.
python3 ca_disco.py -i example/g_100.trees -o example.phy -a example/seq_list.txt -t example/taxa_list.txt