Taxolotl: Taxolotl genome assembly taxonomy summary and assessment tool

Taxolotl v0.1.3

For a better rendering and navigation of this document, please download and open ./docs/taxolotl.docs.html, or visit https://slimsuite.github.io/taxolotl/. Documentation can also be generated by running Taxolotl with the dochtml=T option. (R and pandoc must be installed - see below.)

Introduction

Taxolotl combines the MMseqs2 easy-taxonomy with GFF parsing to perform taxonomic analysis of a genome assembly (and any subsets given by taxsubsets=LIST) using an annotated proteome. Taxonomic assignments are mapped onto genes as well as assembly scaffolds and (if assembly=FILE is given) contigs.

See https://slimsuite.github.io/taxolotl/ and the documentation below for details. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.

Taxolotl is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/taxolotl.

Citing Taxolotl

Taxolotl is currently unpublished. Please cite the GitHub page and this bioRxiv paper, which has an example of Taxolotl in action:

Tobias PA, Edwards RJ, Surana P, Mangelson H, Inácio V, do Céu Silva M, Várzea V, Park RF & Batista D. "A chromosome-level genome resource for studying virulence mechanisms and evolution of the coffee rust pathogen Hemileia vastatrix. bioRxiv 2022.07.29.502101 doi: 10.1101/2022.07.29.502101

Running Taxolotl

Taxolotl is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/taxolotl.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone Taxolotl git repo, $CODEPATH will be the path the to code/ directory. Please see details in the Taxolotl git repo for running on example data.

MMseqs2 must be installed and either added to the environment $PATH.

Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Protein annotation file to assess [annotation.faa]
gffin=FILE      : Protein annotation GFF file [annotation.gff]
cdsin=FILE      : Optional transcript annotation file [annotation.fna]
assembly=FILE   : Optional genome fasta file (required for some outputs) [None]
basefile=X      : Prefix for output files [taxolotl]
gffgene=X       : Label for GFF gene feature type ['gene']
gffcds=X        : Label for GFF CDS feature type ['CDS']
gffmrna=X       : Label for GFF mRNA feature type ['mRNA']
gffdesc=X       : GFF output field label for annotated proteins (e.g. note, product) [product]
taxlevels=LIST  : List of taxonomic levels to report (* for superkingdom and below) ['*']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
dochtml=T/F     : Generate HTML Taxolotl documentation (*.docs.html) instead of main run [False]
### ~ Taxonomy options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
taxdb=FILE      : MMseqs2 taxonomy database for taxonomy assignment [None]
taxbase=X       : Output prefix for taxonomy output [$SEQBASE.$TAXADB]
taxorfs=T/F     : Whether to generate ORFs from assembly if no seqin=FILE given [True]
taxbyseq=T/F    : Whether to parse and generate taxonomy output for each assembly (GFF) sequence [True]
taxbyseqfull=T/F: Whether generate full easy taxonomy report outputs for each assembly (GFF) sequence [False]
taxsubsets=FILELIST : Files (fasta/id) with sets of assembly input sequences (matching GFF) to summarise []
taxwarnrank=X   : Taxonomic rank (and above) to warn when deviating for consensus [family]
bestlineage=T/F : Whether to enforce a single lineage for best taxa ratings [True]
mintaxnum=INT   : Minimum gene count in main dataset to keep taxon, else merge with higher level [2]
### ~ TabReport options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tabreport=FILE  : Convert MMseqs2 report into taxonomy table with counts (if True use taxbase=X) [None]
taxhigh=X       : Highest taxonomic level for tabreport [class]
taxlow=X        : Lowest taxonomic level for tabreport [species]
taxpart=T/F     : Whether to output entries with partial taxonomic levels to tabreport [False]
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X     : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH     : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

Taxolotl overview

The first step is to run MMseqs2:

mmseqs easy-taxonomy $PROTEOME $TAXDB $TAXBASE $TMPDIR

Where $PROTEOME is the proteome provided with seqin=FILE, $TAXDB is a MMseqs2 taxonomic database (see below for creation), provided with taxdb=FILE, $TAXBASE is the easy-taxonomy output prefix, and $TMPDIR is the temporary directory (default tmp). If pre-existing results exist ($TAXBASE._report and $TAXBASE_lca.tsv) then these will be loaded, unless force=T is set. If MMseqs2 is not installed, pre-computed results must be provided. In principle, report and lca.tsv files generate by other tools should work as long as the format is the same.

The core of Taxolotl is the MMSeqs2 "Lowest Common Ancestor" (LCA) assignment, in which each sequence is associated with the lowest unabmigious taxonomic rank possible. Where amibiguity exists, a sequence will be assigned to a higher level. Higher levels also receive all the taxonomic assignments of their daughter taxa, and so the sequence count for any given taxonomic group will always be equal or greater than its lower subdivisions. Conceptually, Taxolotl separates out the counts into taxnum, which are counts at that level or below, and taxpure, which are the numbers assigned specifically to that level. (i.e. taxnum will be the sum of taxpure for that taxonomic group and all lower divisions.) See the MMseqs2 documentation for more details.

Taxolotl will first read in the *_report file to build its internal taxonomy tree for the samples. By default, mmseqs will report all possible taxonomic levels, and Taxolotl will retain the following:

species, species subgroup, species group, subgenus, genus, subtribe, tribe, subfamily, family, superfamily, parvorder, infraorder, suborder, order, superorder, infraclass, subclass, class, superclass, subphylum, phylum, superphylum, subkingdom, kingdom, superkingdom

This can be reduced further by specifying a subset of taxonomic levels of interest with taxlevels=LIST. Any missing levels, along with "no rank" or "clade" taxa (except unclassified, root, and cellular organisms), will be mapped to the next highest taxonomic level. Any MMseqs2 assignments to that level will be transferred to the higher level. Any taxa failing to meet the mintaxnum=INT threshold (default=2) will also be mapped onto higher levels.

Next, the *_lca.tsv file is read and mapped onto the gffin=FILE GFF file to assign proteins to genes and sequences. The lowest-level hit for each gene will be kept, remapping to taxlevels as required. These collated ratings will be output to *.lca_genes.tsv and *.lca_genes.gff Gene ratings are then summed for each assembly sequence, and the dominant classification for each taxonomic level established for (a) each sequence, and (b) the whole dataset. Full collated ratings will be output to *.taxolotl_report.tsv. Ratings per sequence are output to *.taxbyseq.tsv. Dominant taxa are reported in the log file as #BEST entries.

To flag contamination, each sequence is assessed against the dominant taxonomic rating at each taxonomic level. The percentage of genes matching each dominant rating is reported for each sequence in *.taxolotl.tsv along with the number of genes with a rating at that level, separated with a |. This will exclude any genes without ratings at that taxonomic level. A :consensus: entry will also report the overall values for the whole assembly.

Any sequences that have a dominant taxonomic label deviating from the overall consensus at any ranking levels set by taxwarnrank=X (default family) or above will raise a contamination warning and be output in the log file with a #BADTAX rating. These sequences will have their dominant taxon and it's precentage appended to the consensus percentage, also separated by |. For example, 25.00|20|Chordata|50.00 would indicate that 25% of the 20 genes with ratings at that level matched the consensus, whilst the dominant classification was Chordata with 50% of 20 rated genes assigned to this category. Such sequences will also have badtax rating in the rating field of *.taxolotl.tsv. Sequences matching the dominant taxa will have a goodtax rating, whilst sequences without any genes mapped onto taxa by MMseqs2 will be rated notax.

Good, Bad and missing sequence counts will be summarised in the log file in #BEST, BADTAX, and #NOTAX entries. Sequence subsets are output to *.id and *.fasta files, and summarised along with the full assembly in *.seqsummary.tsv. (Any ratings without sequences will not be output/summarised.) If assembly=FILE is provided, sequences without genes will also be summarised. Taxonomy ratings for these subsets are also output to *.$RATING.taxolotl_report.tsv files. Any sequence subsets provided by taxsubsets=LIST (see below) will also be summarised in *.$SUBSET.taxolotl_report.tsv files. It is recommended that all the MMseqs2 _report file is loaded with all the *.taxolotl_report.tsv for visualisation with Pavian (Breitwieser FP and Salzberg SL (2020) Bioinformatics 36(4):1303-1304) through its Shiny App.

Finally, if assembly=FILE is provided (unless taxbycontig=F), contigs will be extracted by splitting scaffolds on mingap=INT (default 10) consecutive Ns. Genes will be remapped onto contigs as with sequences, and taxonomic ratings output to *.taxbyctg.tsv and *.ctgtaxolotl.tsv. These are the contig equivalents of *.taxbyseq.tsv and *.taxolotl.tsv. Contigs without taxonomic ratings will be listed in the log file with #BADTAX entries, unless already reported as an assembly sequence.

Main taxonomy outputs

Outputs will be given a file prefix set by taxbase=X. By default, this will be $SEQBASE.$TAXADB, where $SEQBASE is the basename of seqin=FILE and $TAXADB is the taxonomy database set by taxdb=FILE.

The main mmseqs easy-taxonomy output will generate:

*_lca.tsv = best assignments per protein sequence (protein, taxid, rank, taxname): required.
*_report = text summary of overall taxonomy that can be loaded by Pavian etc.: required.
*_tophit_aln = top database hits for each protein (not currently used): not required.
*_tophit_report = taxonomic classification of the top hit proteins: not required.

In addition, Taxolotl will output:

*.taxbyseq.tsv = Rating counts for each taxonomic group by assembly sequence (scaffold).
*.taxolotl_report.tsv = Collated Kraken-style report file.
*.lca_genes.tsv = Best assignments (lowest taxonomic level) for each gene.
*.lca_genes.gff = GFF file with Taxolotl ratings for each gene.
*.taxolotl.tsv = Tab separated file with consensus taxonomic assignment at each taxonomic rank, and ratings per sequence.
*.$SUBSET.id = Sequence identifiers for assembly subsets based on Taxolotl ratings.
*.$SUBSET.fasta = Fasta files of assembly subsets based on Taxolotl ratings.
*.seqsummary.tsv = Summary statistics for assembly subset fasta files.
*.taxbyctg.tsv = Rating counts for each taxonomic group by assembly contig.
*.ctgtaxolotl.tsv = Taxolotl ratings by assembly contig.

Taxonomy by sequence output

If taxbyseq=T then an additional *.taxbyseq.tsv file will be produced, with the following fields:

seqname = assembly sequence name
genenum = number of genes parsed for that sequence
protnum = number of proteins parsed for that sequence
rank = taxonomic rank of rating
genetax = number of genes with assignment at that level
taxid = taxonomic label identifier number
taxname = taxonomic label name at that rank
taxperc = percentage assignment to this rank or lower
taxnum = number of genes assigned to this rank or lower
taxpure = number of genes assigned to this rank specifically

Sequence subset analysis

In addition to the main output for the whole proteome, any subsets given by taxsubsets=LIST will have their own *.taxolotl_report.tsv file, which can be visualised with Pavian. These must be lists of IDs that match the assembly sequence names in the GFF file. Subsets will be named after the subset file prefix, e.g. assembly.suspect.id would generate *.assembly.suspect.taxolotl_report.tsv.

Generating a taxonomic database

Please see the MMseqs2 documentation for generating a taxonomic database. To date, Taxolotl has been tested with taxonomy databases generated from NCBI nr, using BLAST+ and MMSeqs2 and the NCBI taxonomy dump (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz):

blastdbcmd -db $NCBIPATH/nr -entry all > ncbinr.faa
blastdbcmd -db $NCBIPATH/nr -entry all -outfmt "%a %T" > ncbinr.faa.taxidmapping

mmseqs createdb ncbinr.faa ncbinr.faaDB
mmseqs createtaxdb ncbinr.faaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file ncbinr.faa.taxidmapping
mmseqs createindex ncbinr.faaDB tmp

If the assembly itself is already in RefSeq, it is recommended that the taxa of the assembly is removed before running Taxolotl.

Simple ORF mode

If no proteins are given, ORFs will be generated by SeqSuite with default settings minorf=100 rftran=6 terminorf=50 orfgaps=F, i.e. ORFs of 100+ amino acids from all six reading frames, or 50+ amino acids if truncated at the end of a sequence. ORFs will not span assembly gaps, and any ambiguous (X) translations will be replaced with stop codons (*), unless orfgaps=T is set. Note that, due to introns, it is expected that these ORFs will often represent partial coding sequences, and many will be random junk translations.

The idea of ORF mode is to provide a quick, crude impression of the taxonomic profile. However, for large assemblies it can be very slow to process.

In ORF mode, each ORF is assumed to represent a different gene, although this may not be the case. Currently, SeqSuite will not generate a GFF file for the ORFs. As a result, the taxbycontig output is not available.