GToTree is a user-friendly workflow for phylogenomics intended to give more researchers the capability to create phylogenomic trees. The open-access Bioinformatics Journal publication is available here, and documentation and examples can be found at the wiki here.
See the conda quickstart installation page to have things up and running in just a couple steps!
GToTree is a more structured implementation of a workflow I would put together everytime I wanted to make a large-scale phylogenomic tree. What do I mean by large-scale? Anything from a full-blown Tree of Life with all 3 domains, down to, for example, all available genomes of Staphylococcus alongside new isolate genomes. At its heart it just takes in genomes and outputs an alignment and phylogenomic tree based on the specified HMM profiles. But I think its value comes from three main things: 1) its flexibility with regard to input format - taking fasta files, GenBank files, and/or NCBI accessions (So if you just recovered a bunch of new genomes and you want to see where they fit in with references, you can provide references by accession and your new genomes as fasta files.); 2) its automation of required between-tool tasks such as filtering hits by gene-length, filtering out genomes with too few hits to the target genes, and swapping genome labels for something more useful; and 3) its scalability – GToTree can turn ~1,700 input genomes into a tree in ~60 minutes on a standard laptop.
Also included are several newly generated single-copy gene-sets for 13 different taxonomical groupings. These are presented in the wiki, along with an explanation and example code/steps used in the generation of them.
GToTree utilizes helper scripts written in python, but is primarily implemented in bash. Every attempt is being made to make it portable across all variations of GNU/Unix, including on Macs, so if you run into any issues, it'd be appreciated if you could report them so the problems can be found and fixed!
See the conda quickstart installation page to get GToTree up and running in just a couple steps!
See the "What is GToTree?" wiki page for some more detail on the processing steps pictured above. For practical ways GToTree can be helpful, check out the Example usage page. And for detailed information on using GToTree, see the User guide.
NOTE: The conda installation takes care of all of these!
If you use GToTree, please cite these folks :)
- Biopython - citation
- HMMER3 v3.2.1 - citation: they note in the user manual to cite the website, but there is also this paper
- Muscle v3.8 - citation
- Trimal v1.4 - citation
- FastTree v2.1 - citation
If you use GToTree in a manner that uses these tools, please cite these folks :)
- Prodigal v2.6.3 - citation
- if providing input genomes in fasta format, or GenBank format with no CDS annotations, or NCBI accessions to genomes with no gene calls
- TaxonKit v0.3 - citation
- if changing genome labels based on lineage information for input genomes with associated NCBI taxids
- GNU Parallel v20161122 - citation info
- if running in parallel