TABR-BERT: an Accurate and Robust BERT-based Transfer Learning Model for TCR-pMHC Interaction Prediction
Publication: https://doi.org/10.1093/bib/bbad436
Contract: hui.yao@freshwindbiotech.com
There are two ways to run TABR-BERT
The Installation of Docker can be seen in https://docs.docker.com/
Pull the image of TABR-BERT from dockerhub:
docker pull freshwindbioinformatics/tabr-bert:v1
Run the image in bash:
docker run -it --gpus all freshwindbioinformatics/tabr-bert:v1 bash
- python == 3.9.12
- mhcnames == 0.4.8
- numpy == 1.21.5
- pandas == 1.2.0
- scikit_learn == 1.1.3
- scipy == 1.8.0
- torch == 1.11.0
* Note : If you want to use the GPU, you should install CUDA and cuDNN version compatible with the pytorch version. Version Searching
Command:
conda create -n tabr_bert python==3.9.12
conda activate tabr_bert
pip install -r requirements.txt
* Note : How to download and install conda? Documentation.
You can find the data used to train TCR-BERT, pMHC-BERT and healthy TCR dataset at https://zenodo.org/record/8215354
Usage: pre_train_tcr_embedding_model.py [options]
Required:
--input STRING: The input data to train the TCR embedding model (*.csv)
Required columns: "cdr3"
--model_dir STRING: where to save the model (*.pt)
Optional:
--n_layers INT: number of transformer encoder layers (default: 4)
--d_model INT: number of embedding dimention (default: 256)
--batchsize INT: mini batchsize (default: 1024)
--lr Float: learning rate (default: 5e-5)
--max_epoch INT: Maximum number of train epoch (default: 100)
--GPUs INT: num of GPUs used in this task(default: 2)
*Note : If you use docker, then you can train the TCR embedding model directly with the following command:
python pre_train_tcr_embedding_model.py
This requires two GPUs with more than 8G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training.
Usage: pre_train_pmhc_embedding_model.py [options]
Required:
--input STRING: The input data to train the pMHC embedding model (*.csv)
Required columns: ["allele", "peptide", "label"]
--random_peptide STRING: natural peptides for generating negative cases (*.csv)
Required columns: "peptide"
--model_dir STRING: where to save the model (*.pt)
Optional:
--n_layers INT: number of transformer encoder layers (default: 4)
--d_model INT: number of embedding dimention (default: 256)
--neg_X INT: negative case multiple (default: 2)
--batchsize INT: mini batchsize (default: 1024)
--lr Float: learning rate (default: 5e-5)
--max_epoch INT: Maximum number of train epoch (default: 100)
--GPUs INT: num of GPUs used in this task(default: 2)
*Note : If you use docker, then you can train the pMHC embedding model directly with the following command:
python pre_train_pmhc_embedding_model.py
This requires two GPUs with more than 14G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training.
Usage: train_tcr_pmhc_prediction_model.py [options]
Required:
--input STRING: The input data to train the TCR-pMHC prediction model (*.csv)
Required columns: ["allele", "peptide", "cdr3"]
--healthy_tcr STRING: TCRs from healthy people for generating negative cases (*.csv)
Required columns: "cdr3"
--pseudo_sequence_dict STRING: allele name to pseudo sequence (*.csv)
Required columns: ["allele" "sequence"]
--tcr_model STRING: TCR embedding model dir (*.pt)
--pmhc_model STRING: pMHC embedding model dir (*.pt)
--model_dir STRING: where to save the model (*.pt)
Optional:
--batchsize INT: mini batchsize (default: 256)
--embedding_batchsize INT: mini batchsize of generation embedding (default: 256)
--pmhc_d_model INT: dimention of pmhc embedding (default: 256)
--tcr_d_model INT: dimention of pmhc embedding (default: 256)
--lr Float: learning rate (default: 5e-4)
--max_epoch INT: Maximum number of train epoch (default: 500)
--GPUs INT: num of GPUs used in this task(default: 2)
*Note : If you use docker, then you can train the TCR-pMHC prediction model directly with the following command:
python train_tcr_pmhc_prediction_model.py
This requires two GPUs with more than 5G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training.
Usage: predict_tcr_pmhc_binding.py [options]
Required:
--input STRING: The data to be predicted (*.csv)
Required columns: ["allele", "peptide", "cdr3"]
--healthy_tcr STRING: TCRs from healthy people for generating negative cases (*.csv)
Required columns: "cdr3"
--pseudo_sequence_dict STRING: allele name to pseudo sequence (*.csv)
Required columns: ["allele" "sequence"]
--tcr_pmhc_model STRING: TCR-pMHC prediction model dir (*.pt)
--tcr_model STRING: TCR embedding model dir (*.pt)
--pmhc_model STRING: pMHC embedding model dir (*.pt)
--output STRING: output file dir (*.csv)
Optional:
--batchsize INT: mini batchsize (default: 256)
--embedding_batchsize INT: mini batchsize of generation embedding (default: 256)
--pmhc_d_model INT: dimention of pmhc embedding (default: 256)
--tcr_d_model INT: dimention of pmhc embedding (default: 256)
--GPUs INT: num of GPUs used in this task [if you have GPU recommend 1, if not, recommend 0] (default: 0)
python predict_tcr_pmhc_binding.py --input input_data.csv
Jiawei Zhang, Wang Ma, Hui Yao, "Accurate TCR-pMHC interaction prediction using a BERT-based transfer learning method", Briefings in Bioinformatics, Volume 25, Issue 1, January 2024, bbad436, https://doi.org/10.1093/bib/bbad436