HOTSPOT is a learning-based tool designed for plasmid host prediction. Its backbone is a phylogenetic tree of plasmids' hosts (bacteria) from phylum to species. The top-down tree search can accurately predict the hosts' taxonomic labels by incorporating the state-of-the-art language model, Transformer, in each node’s taxon classifier. To use HOTSPOT, you only need to input plasmid sequences (complete or segmented) into the program.
If you want to use the gpu to accelerate the program:
- cuda
- Pytorch-gpu
- For cpu version pytorch:
conda install pytorch torchvision torchaudio cpuonly -c pytorch
- For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer
After cloning this repository (git clone https://github.com/Orin-beep/HOTSPOT
), you can use anaconda to install the environment.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Otherwise, it will run with cpu version). The command is:
cd HOTSPOT/
conda env create -f environment.yaml -n hotspot
conda activate hotspot
To download the database and pre-trained models, you can simply use these bash scripts:
sh prepare_db.sh # download and unzip the database, 161.2 MB
sh prepare_mdl.sh # download and unzip the models, 13.14 GB
If the bash scripts do not work, you can manually download the database and models using the following links:
After downloading the database.tgz
and models.tgz
to HOTSPOT's main directory, you have to unzip them:
tar -zxvf database.tgz
rm database.tgz
tar -zxvf models.tgz
rm models.tgz
Before predicting the hosts, you have to run preprocessing.py
, which filters lengths and encodes features for input plasmid sequences. Then, you can use HOTSPOT.py
for host prediction with the pre-trained Transformer models. By default, the temporary files are stored in the folder temporary_files/
, and the prediction results are stored in the TSV file Results/host_lineage.tsv
.
python preprocessing.py --contigs Example_fasta/multiple_plasmids.fasta
python HOTSPOT.py # Recommend using gpu to accelerate the program
The output is a TSV file containing the predicted host lineages from phylum to species. Each row corresponds to one input plasmid contig. For example:
Contig | phylum | class | order | family | genus | species |
---|---|---|---|---|---|---|
NZ_CP050042.1 | Pseudomonadota | Gammaproteobacteria | Enterobacterales | Enterobacteriaceae | Escherichia | Escherichia coli |
NZ_CP083619.1 | Bacillota | Clostridia | Eubacteriales* | Peptostreptococcaceae | Clostridioides* | Clostridioides difficile* |
NZ_CP083659.1 | Pseudomonadota | Gammaproteobacteria | Moraxellales | Moraxellaceae* | Acinetobacter | Acinetobacter variabilis |
Z22927.1 | Actinomycetota | Actinomycetes* | Corynebacteriales | Corynebacteriaceae | Corynebacterium* | Corynebacterium glutamicum* |
Notably, the taxon labeled with a star *
is not predicted by the taxon classifier because its parent node has only one child in the tree.
The current phylogenetic tree used by HOTSPOT is smaller than the complete bacterial phylogenetic tree because: 1) not all bacteria contain plasmids, and 2) the host taxa covered by available sequenced plasmids are limited. Thus, we advise users to examine starred taxa more carefully.
preprocessing.py:
The usage of preprocessing.py:
[-h, --help] Show the help message and exit
[--contigs INPUT_FA] FASTA file of the input sequences (one or more contigs in a single FASTA file, default test_contigs.fa)
[--len MINIMUM_LEN] Minimum length (bp) of contigs for length filtering (default 1500)
[--threads NUM] Number of threads to use (default 8)
[--dbdir DR] Database directory (default database/)
[--midfolder DIR] Folder to store the intermediate files from preprocessing (default temporary_files/)
HOTSPOT.py:
The usage of HOTSPOT.py:
[--midfolder DIR] Folder to store the intermediate files from preprocessing (used as the inputs of HOTSPOT.py, default temporary_files/)
[--mdldir DR] Pre-trained models' directory (default models/)
[--dbdir DR] Database directory (default database/)
[--out OUT] Path to store the output files (default Results/)
[--threads NUM] Number of threads to use if 'cpu' is detected ('cuda' not found, default 8)
[--accurate ACC] If this parameter is 1, the MC-dropout based early stop mechanism will be activated with two sets of uncertainty cutoffs, and the prediction will cost more time.
1. sensitive mode (the default mode without early stop, output: 'Results/host_lineage.tsv')
2. specific mode (enabling the early stop, output: 'Results/host_lineage_specific.tsv')
3. accurate mode (enabling the early stop with more stringent uncertainty cutoff, leading to more accurate prediction but returning taxa in higher levels for some inputs, output: 'Results/host_lineage_accurate.tsv')
(default 0)
[--mcnum MC] The number of the dropout-enabled forward passes to estimate the prediction uncertainty (works when '--accurate 1' is chosen, default: 100, minimum: 10)
HOTSPOT provides two special modes, specific mode and accurate mode, aiming at higher accuracy using the MC-dropout based early stop for the tree search. To enable the early stop, you can use the option --accurate 1
when running HOTSPOT.py
, and the results of the two modes will be stored in the output directory. Specifically, the accurate mode has a more stringent uncertainty cutoff than the specific mode, leading to more accurate prediction but returning taxa in higher levels for some inputs. In addition, the number of dropout-enabled forward passes can be chosen by the option --mcnum
(default: 100).
For example (the prediction will take more time):
python HOTSPOT.py --accurate 1
You can download the plasmid contigs and raw data of the datasets evaluated in the paper 'HOTSPOT: Hierarchical hOst predicTion for aSsembled Plasmid cOntigs with Transformer' with the following links:
Datasets | Annotated plasmid contigs | Raw data and description |
---|---|---|
Simulated metagenomic data | plasmid_contigs_mag.fa | original_dataset.tar.gz contains the assembled contigs and the code generating the simulated data |
Mock metagenomic data | SRR072232.fasta, SRR072233.fasta, SRR172902.fasta, SRR172903.fasta | SRR072232, SRR072233, SRR172902, SRR172903. The reference genomes: reference_genomes_mock.fasta |
Hi-C dataset | plasmid_contigs_hi-c.fa | wastewater_hi-c_data.tar.gz. Data source: https://osf.io/ezb8j/wiki/home/ |
We run HOTSPOT with 8 threads and gpu on 4,536 complete plasmids (333MB) as an example. The required running time for the two steps is listed below:
preprocessing.py | HOTSPOT.py | Total running time |
---|---|---|
3h38min | 52.65s | 3h39min |
Thus, most of the time is used to run Prodigal and DIAMOND BLASTP for preprocessing.
If you have any questions, please email us:
yongxinji2-c@my.cityu.edu.hk (Yongxin JI)
jyshang2-c@my.cityu.edu.hk (Jiayu SHANG)
xubotang2-c@my.cityu.edu.hk (Xubo TANG)