Tempo

This is the Pytorch implementation of TEMPO in the paper: [TEMPO: A Transformer-based Mutation Prediction Framework for SARS-CoV-2 Evolution].

Requirements

pytorch
sklearn

Data preparation

Protein sequence data:

This data file contains the original and preprossed protein sequence data for SARS-COV-2, H1N1, H3N2 and H5N1, which is necessary to run the code. Before running the code, data.zip shuold be downloaded separately， you can click here to download the data for convenience.

The files contained in data.zip

Preprocessed data used to reproduce the paper， including SARS-COV-2, H1N1, H3N2 and H5N1 dataset.
Phylogenetic tree data for SARS-COV-2, named as "tree.txt".
COV-19 s-protein sequence data aligned by mafft, named as "spike_prot_processed.csv".

Phylogenetic tree data:

This is a supplementary data which is not necessary to run the code, while it could be helpful for others to understand our paper in more depth and to do further work based on it. The phylogenetic tree data for SARS-COV-2 can be found at here.

Usage

To run the code

add the "data.zip" to the root directory of the project(at the same level as "training.py")
decompress the data and you will get a folder named data.

unzip data.zip

modify the dataset path defined in training.py(line 14 to line 31), corresponding to your data folder's path in your enviroment.
train the model which the folllowing command:

python training.py > output.txt

Output

The results are output for every 10 epochs of the training process. The following metrics will be recorded in output.txt file：

T_loss: training loss of this epoch
T_acc: training accuracy of this epoch
T_pre: training precision of this epoch
T_rec: training recall of this epoch
T_fscore: training f1 score of this epoch
T_mcc: training matthews correlation coefficient of this epoch
V_loss: validation loss of this epoch
V_acc: validation accuracy of this epoch
V_pre: validation precision of this epoch
V_rec: validation recall of this epoch
V_fscore: validation f1 score of this epoch
V_mcc: validation matthews correlation coefficient of this epoch
BEST_V_loss: best validation loss of all iterations so far
BEST_V_acc: best validation accuracy of all iterations so far
BEST_V_pre: best validation precision of all iterations so far
BEST_V_rec: best validation recall of this all iterations so far
BEST_V_fscore: best validation f1 score of all iterations so far
BEST_V_mcc: best validation matthews correlation coefficient of all iterations so far

ZJUDataIntelligence/Tempo