Taxolotl v0.1.3
For a better rendering and navigation of this document, please download and open ./docs/taxolotl.docs.html
, or visit https://slimsuite.github.io/taxolotl/.
Documentation can also be generated by running Taxolotl with the dochtml=T
option. (R and pandoc must be installed - see below.)
Taxolotl combines the MMseqs2 easy-taxonomy
with GFF parsing to perform taxonomic analysis of a genome assembly
(and any subsets given by taxsubsets=LIST
) using an annotated proteome. Taxonomic assignments are mapped onto
genes as well as assembly scaffolds and (if assembly=FILE
is given) contigs.
See https://slimsuite.github.io/taxolotl/ and the documentation below for details. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.
Taxolotl is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/taxolotl.
Taxolotl is currently unpublished. Please cite the GitHub page and this bioRxiv paper, which has an example of Taxolotl in action:
- Tobias PA, Edwards RJ, Surana P, Mangelson H, Inácio V, do Céu Silva M, Várzea V, Park RF & Batista D. "A chromosome-level genome resource for studying virulence mechanisms and evolution of the coffee rust pathogen Hemileia vastatrix. bioRxiv 2022.07.29.502101 doi: 10.1101/2022.07.29.502101
Taxolotl is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/taxolotl.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone Taxolotl git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the Taxolotl git repo
for running on example data.
MMseqs2 must be installed and either added to the environment $PATH
.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Protein annotation file to assess [annotation.faa]
gffin=FILE : Protein annotation GFF file [annotation.gff]
cdsin=FILE : Optional transcript annotation file [annotation.fna]
assembly=FILE : Optional genome fasta file (required for some outputs) [None]
basefile=X : Prefix for output files [taxolotl]
gffgene=X : Label for GFF gene feature type ['gene']
gffcds=X : Label for GFF CDS feature type ['CDS']
gffmrna=X : Label for GFF mRNA feature type ['mRNA']
gffdesc=X : GFF output field label for annotated proteins (e.g. note, product) [product]
taxlevels=LIST : List of taxonomic levels to report (* for superkingdom and below) ['*']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
dochtml=T/F : Generate HTML Taxolotl documentation (*.docs.html) instead of main run [False]
### ~ Taxonomy options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
taxdb=FILE : MMseqs2 taxonomy database for taxonomy assignment [None]
taxbase=X : Output prefix for taxonomy output [$SEQBASE.$TAXADB]
taxorfs=T/F : Whether to generate ORFs from assembly if no seqin=FILE given [True]
taxbyseq=T/F : Whether to parse and generate taxonomy output for each assembly (GFF) sequence [True]
taxbyseqfull=T/F: Whether generate full easy taxonomy report outputs for each assembly (GFF) sequence [False]
taxsubsets=FILELIST : Files (fasta/id) with sets of assembly input sequences (matching GFF) to summarise []
taxwarnrank=X : Taxonomic rank (and above) to warn when deviating for consensus [family]
bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [True]
mintaxnum=INT : Minimum gene count in main dataset to keep taxon, else merge with higher level [2]
### ~ TabReport options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tabreport=FILE : Convert MMseqs2 report into taxonomy table with counts (if True use taxbase=X) [None]
taxhigh=X : Highest taxonomic level for tabreport [class]
taxlow=X : Lowest taxonomic level for tabreport [species]
taxpart=T/F : Whether to output entries with partial taxonomic levels to tabreport [False]
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
The first step is to run MMseqs2:
mmseqs easy-taxonomy $PROTEOME $TAXDB $TAXBASE $TMPDIR
Where $PROTEOME
is the proteome provided with seqin=FILE
, $TAXDB
is a MMseqs2 taxonomic database
(see below for creation), provided with taxdb=FILE
, $TAXBASE
is the easy-taxonomy
output prefix, and
$TMPDIR
is the temporary directory (default tmp
). If pre-existing results exist ($TAXBASE._report
and
$TAXBASE_lca.tsv
) then these will be loaded, unless force=T
is set. If MMseqs2 is not installed, pre-computed
results must be provided. In principle, report
and lca.tsv
files generate by other tools should work as long
as the format is the same.
The core of Taxolotl is the MMSeqs2 "Lowest Common Ancestor" (LCA) assignment, in which each sequence is associated
with the lowest unabmigious taxonomic rank possible. Where amibiguity exists, a sequence will be assigned to a
higher level. Higher levels also receive all the taxonomic assignments of their daughter taxa, and so the sequence
count for any given taxonomic group will always be equal or greater than its lower subdivisions. Conceptually,
Taxolotl separates out the counts into taxnum
, which are counts at that level or below, and taxpure
, which are
the numbers assigned specifically to that level. (i.e. taxnum
will be the sum of taxpure
for that taxonomic
group and all lower divisions.) See the MMseqs2 documentation for more details.
Taxolotl will first read in the *_report
file to build its internal taxonomy tree for the samples. By default, mmseqs will report all possible taxonomic levels, and Taxolotl will retain the following:
species, species subgroup, species group, subgenus, genus, subtribe, tribe, subfamily, family, superfamily, parvorder, infraorder, suborder, order, superorder, infraclass, subclass, class, superclass, subphylum, phylum, superphylum, subkingdom, kingdom, superkingdom
This can be reduced further by specifying a subset of taxonomic levels of interest with taxlevels=LIST
. Any missing levels, along with
"no rank" or "clade" taxa (except unclassified
, root
, and cellular organisms
), will be mapped to the next highest taxonomic level. Any MMseqs2 assignments to that level will be transferred to the higher level. Any taxa failing to meet the mintaxnum=INT
threshold (default=2) will also be mapped onto higher levels.
Next, the *_lca.tsv
file is read and mapped onto the gffin=FILE
GFF file to assign proteins to genes and
sequences. The lowest-level hit for each gene will be kept, remapping to taxlevels
as required. These
collated ratings will be output to *.lca_genes.tsv
and *.lca_genes.gff
Gene ratings are then summed for each assembly sequence, and the dominant
classification for each taxonomic level established for (a) each sequence, and (b) the whole dataset. Full
collated ratings will be output to *.taxolotl_report.tsv
. Ratings per sequence are output to *.taxbyseq.tsv
. Dominant taxa are reported in the log file as #BEST
entries.
To flag contamination, each sequence is assessed against the dominant taxonomic rating at each taxonomic level.
The percentage of genes matching each dominant rating is reported for each sequence in *.taxolotl.tsv
along with the number of genes with a rating at that level, separated with a |
. This will exclude any genes
without ratings at that taxonomic level. A :consensus:
entry will also report the overall values for the whole
assembly.
Any sequences that have a dominant taxonomic label deviating from the overall consensus at any ranking levels
set by taxwarnrank=X
(default family) or above will raise a contamination warning and be output in the log file with a #BADTAX
rating. These sequences will have their dominant taxon and it's
precentage appended to the consensus percentage, also separated by |
. For example, 25.00|20|Chordata|50.00
would indicate that 25% of the 20 genes with ratings at that level matched the consensus, whilst the dominant
classification was Chordata
with 50% of 20 rated genes assigned to this category. Such sequences will also have badtax
rating in the rating
field of *.taxolotl.tsv
. Sequences matching the dominant taxa will have a goodtax
rating, whilst sequences without any genes mapped onto taxa by MMseqs2 will be rated notax
.
Good, Bad and missing sequence counts will be summarised in the log file in #BEST
, BADTAX
, and #NOTAX
entries.
Sequence subsets are output to *.id
and *.fasta
files, and summarised along with the full assembly in
*.seqsummary.tsv
. (Any ratings without sequences will not be output/summarised.) If assembly=FILE
is provided,
sequences without genes will also be summarised. Taxonomy ratings for these subsets are also output to
*.$RATING.taxolotl_report.tsv
files. Any sequence subsets provided by taxsubsets=LIST
(see below) will also be
summarised in *.$SUBSET.taxolotl_report.tsv
files. It is recommended that all the MMseqs2 _report
file is loaded
with all the *.taxolotl_report.tsv
for visualisation with Pavian
(Breitwieser FP and Salzberg SL (2020) Bioinformatics 36(4):1303-1304)
through its Shiny App.
Finally, if assembly=FILE
is provided (unless taxbycontig=F
), contigs will be extracted by splitting scaffolds on mingap=INT
(default 10) consecutive N
s. Genes will be remapped onto contigs as with sequences, and taxonomic ratings output to *.taxbyctg.tsv
and *.ctgtaxolotl.tsv
. These are the contig equivalents of *.taxbyseq.tsv
and *.taxolotl.tsv
. Contigs without taxonomic ratings will be listed in the log file with #BADTAX
entries, unless already reported as an assembly sequence.
Outputs will be given a file prefix set by taxbase=X
. By default, this will be $SEQBASE.$TAXADB
, where
$SEQBASE
is the basename of seqin=FILE
and $TAXADB
is the taxonomy database set by taxdb=FILE
.
The main mmseqs easy-taxonomy
output will generate:
*_lca.tsv
= best assignments per protein sequence (protein, taxid, rank, taxname): required.*_report
= text summary of overall taxonomy that can be loaded by Pavian etc.: required.*_tophit_aln
= top database hits for each protein (not currently used): not required.*_tophit_report
= taxonomic classification of the top hit proteins: not required.
In addition, Taxolotl will output:
*.taxbyseq.tsv
= Rating counts for each taxonomic group by assembly sequence (scaffold).*.taxolotl_report.tsv
= Collated Kraken-style report file.*.lca_genes.tsv
= Best assignments (lowest taxonomic level) for each gene.*.lca_genes.gff
= GFF file with Taxolotl ratings for each gene.*.taxolotl.tsv
= Tab separated file with consensus taxonomic assignment at each taxonomic rank, and ratings per sequence.*.$SUBSET.id
= Sequence identifiers for assembly subsets based on Taxolotl ratings.*.$SUBSET.fasta
= Fasta files of assembly subsets based on Taxolotl ratings.*.seqsummary.tsv
= Summary statistics for assembly subset fasta files.*.taxbyctg.tsv
= Rating counts for each taxonomic group by assembly contig.*.ctgtaxolotl.tsv
= Taxolotl ratings by assembly contig.
If taxbyseq=T
then an additional *.taxbyseq.tsv
file will be produced, with the following fields:
seqname
= assembly sequence namegenenum
= number of genes parsed for that sequenceprotnum
= number of proteins parsed for that sequencerank
= taxonomic rank of ratinggenetax
= number of genes with assignment at that leveltaxid
= taxonomic label identifier numbertaxname
= taxonomic label name at that ranktaxperc
= percentage assignment to this rank or lowertaxnum
= number of genes assigned to this rank or lowertaxpure
= number of genes assigned to this rank specifically
In addition to the main output for the whole proteome, any subsets given by taxsubsets=LIST
will have their own *.taxolotl_report.tsv
file, which can be visualised with Pavian. These must be lists of IDs that match the assembly sequence names in the GFF file. Subsets will be named after the subset file prefix, e.g. assembly.suspect.id
would generate *.assembly.suspect.taxolotl_report.tsv
.
Please see the MMseqs2 documentation for generating a taxonomic database. To date, Taxolotl has been tested with taxonomy databases generated from NCBI nr, using BLAST+ and MMSeqs2 and the NCBI taxonomy dump (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz):
blastdbcmd -db $NCBIPATH/nr -entry all > ncbinr.faa
blastdbcmd -db $NCBIPATH/nr -entry all -outfmt "%a %T" > ncbinr.faa.taxidmapping
mmseqs createdb ncbinr.faa ncbinr.faaDB
mmseqs createtaxdb ncbinr.faaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file ncbinr.faa.taxidmapping
mmseqs createindex ncbinr.faaDB tmp
If the assembly itself is already in RefSeq, it is recommended that the taxa of the assembly is removed before running Taxolotl.
If no proteins are given, ORFs will be generated by SeqSuite
with default settings minorf=100 rftran=6 terminorf=50 orfgaps=F
, i.e. ORFs of 100+ amino acids from all six reading frames, or 50+ amino acids if truncated at the end of a sequence. ORFs will not span assembly gaps, and any ambiguous (X
) translations will be replaced with stop codons (*
), unless orfgaps=T
is set. Note that, due to introns, it is expected that these ORFs will often represent partial coding sequences, and many will be random junk translations.
The idea of ORF mode is to provide a quick, crude impression of the taxonomic profile. However, for large assemblies it can be very slow to process.
In ORF mode, each ORF is assumed to represent a different gene, although this may not be the case. Currently, SeqSuite
will not generate a GFF file for the ORFs. As a result, the taxbycontig
output is not available.
© 2021 Richard Edwards | richard.edwards@unsw.edu.au