TransAnnot: fast and all-in-one transcriptome annotation
TransAnnot is a toolkit designed to predict protein functions, identify orthologous relationships, and decipher biological pathways for newly sequenced transcriptomes. Utilizing MMseqs2's fast sequence-sequence and sequence-profile search, it identifies the closest homologs from reference databases to infer essential details such as protein function, structure, and orthologous groups.
Optionally, TransAnnot can use Plass for transcriptome assembly, enabling de novo assembly of raw sequence reads at the protein level.
TransAnnot is a free and open source (GPLv3), modular toolkit developed in C++.
Compile from source
Compiling TransAnnot from source allows for system-specific optimization. For the compilation cmake
, g++
and git
are required. After the compilation, TransAnnot will be located in build/bin
directory.
git clone https://github.com/soedinglab/transannot.git
cd transannot && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 4
make install
export PATH=$(pwd)/transannot/bin/:$PATH
❗️ If you compile from source under macOS we recommend to install and use gcc
instead of clang
as a compiler. gcc
can be installed with Homebrew. Force cmake
to use gcc
as a compiler by running:
CC="$(brew --prefix)/bin/gcc-13"
CCX="$(brew --prefix)/bin/g++-13"
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
Other dependencies for the compilation from source are zlib
and bzip
.
Workflow dependencies
- Plass - should be installed separately, see corresponding repository. To perform de novo assembly, it is required to install Plass to the current working directory.
Before starting
tmp folder
tmp
folder keeps temporary files. By default, all the intermediate output files from different modules will be kept in this folder. To clear tmp
pass --remove-tmp-files
parameter.
Quick start
The quickest way to run TransAnnot is using the easy
workflow:
transannot easytransannot <inputReads.fastq> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]
If (one of the) target databases are already downloaded in MMseqs2 format, directly provide the path to them, otherwise simply specify their names, and the databases will be downloaded automatically.
Input
Possible inputs are:
- assembled transcriptomes (aobtained e.g. using Trinity) or raw transcriptome reads, which will be de novo assembled on the protein level using
plass
- metatranscriptomes
- single-organism transcriptomes
Running
Modules
assemblereads
de novo assembles raw sequencing reads to large genomic fragments (contigs).annotate
clusters given input for the reduction of redundancy and runs sequnce-profile and sequence-sequence searches to obtain the closest homologs with annotated function. It also retrieves descriptions of orthologous groups and protein families throgh mapping.
createquerydb
creates a database from the sequence space (obtained fromdownloaddb
module) in a memory-efficient MMSeqs2 format.downloaddb
downloads databases that serve as a search space for homology detectioneasytransannot
easy module for a quick start, performs assembly, downloads DB and executes annotation
(Plass) Assembly
Plass is the default assembler, which is used in the easytransannot
workflow as well. However, we recommend assembly with Trinity since Trinity provides more reliable assemblies compared to Plass. If assembly was performed using Trinity, proceed with createquerydb
and further annotation.
Before running this step Plass must be installed, detailed information about installation can be found here. Please make sure PLASS is located in the current working directory.
In this step, reads will be assembled with Plass and afterwards a MMseqs2 database will be created, you may skip this step if the transcriptome is already assembled. Usage:
transannot assemblereads <inputReads.fastq[.gz|bz]> ... <inputReads.fastq[.gz|bz]> <o: fastaFile with assembly> <o: seqDB> <tmp> [options]
Dowloading databases
In this step, sequence databases for homology searches will be downloaded.
To see detailed information about databases, please use the following command:
mmseqs databases -h
and execute the below command to download the databases (Ensure the same keyword as given in mmseqs database -h
):
transannot downloaddb <selection> <outDB> <tmp> [options]
Hence, transannot
runs 3 searches in annotate
module, this step should be repeated 3 times. For the annotation module Pfam-A.full
, eggNOG
(profile database) and UniProtKB/SwissProt
(sequence database) are standard, so please download them using this module, for more information also check MMseqs2 user guide.
Annotate workflow
In the annotate
module representative sequences will be extracted and used as search input to remove redundancy. 3 searches (one sequence-sequence and two seqeuce-profile) will be performed.
To run annotate module of transannot execute the following command:
transannot annotate <assembledQueryDB> <path to Pfam profileTargetDB> <path to eggNOG profileTargetDB> <path to SwissProt sequenceTargetDB> <o:resTsvFile> <tmp> [options]
Important options of the annotate module
--simple-output
parameter allows user to obtain simplified output, which only includes query and target IDs, header of the target database and E-value. Whereas standard output also contains sequence identity and bit score for each target sequence. Usage:
transannot annotate $1 $2 $3 $4 $5 $6 --simple-output
When no tag is used, standard output will be provided.
--min-seq-id
is a parameter to adjust minimum sequence identity for the searches. Default value is set to 0.3.
--no-run-clust
performs annotation without clustering. All the input sequences will undergo similarity searches.
Output
Outut is a tab-separated .tsv
file containing following columns:
queryID targetID description E-value sequenceIdentity bitScore typeOfSearch nameOfDatabase