Generating protein sequences with specified topological structures.
git clone https://github.com/AI-ProteinGroup/TopoProGenerator.git
cd TopoProGenerator
nvidia-smi
According to the CUDA version, install a compatible version of Pytorch on the [Pytorch website](https://pytorch.org/)
python
import torch
torch.cuda.is_available()
If return True
, pytorch is already installed.
pip install -r requirements.txt
Installation time is within 1 hour
Download protbert parameterfile [protbert.tar.gz] on (https://zenodo.org/record/8129221)
Then
tar -xzvf protbert.tar.gz
Place the address of the foldprotbert
in the below two places in file policy_transformer/src/predict_model.py
or policy_LSTM/src/predict_model.py
:
self.tokenizer = BertTokenizer.from_pretrained('****/protbert', do_lower_case=False)
self.model = BertModel.from_pretrained('****/protbert')
If you want to train a model yourself from scratch, please skip this step.
Choose the model you want to use.(Transformer or LSTM)
cd policy_transformer
or
cd policy_LSTM
Download the model file you need on (https://zenodo.org/record/8129221).
We have provided model parameter files for TopoProGenerator[model_transformer.pth] and LSTM[model_LSTM.pt] as a reference model.
If you want to use a trained model, skip this step.
You need to process the sequence data set into the form like (i means 'HHH')
iDEEERRVEELIEEARELEKRNPEEARKVLEEAYELAKRINDPLLEEVEKLLRRLR
iSEHEERIRELLERARRIPDKEEARRLVEEAIRIAEENNDEELLKKAREILEEIKR
save it as *.csv
, which is the dataset for pretraining.
And process the ori sequence sets like
DEEERRVEELIEEARELEKRNPEEARKVLEEAYELAKRINDPLLEEVEKLLRRLR
SEHEERIRELLERARRIPDKEEARRLVEEAIRIAEENNDEELLKKAREILEEIKR
save it as *.txt
and *.csv
, which are the dataset for fine-tuning.
For TPG, both the pretraining dataset and fine-tuning dataset are in ./data
.
!!!model parameters need to be consistent during pretraining, fine-tuning and generate.(such as tgt_len
, d_embed
, n_layers
and so on).
Edit /config/pretrain_transformer.json
name | content |
---|---|
datasets | Address of the dataset used for pretraining |
datasets_col | The column where the protein sequence is located (starting from 0) |
save_addr | Address of output model file |
Start pretraining
python pretrain_transformer.py --config ./config/pretrain_transformer.json
Using A100, the pretraining time for transformer is within 5 hours, and the pretraining time for LSTM is within 2 days.
Edit /config/fine-tuning_transformer.json
(model parameters need to be consistent with the pretraining)
name | content |
---|---|
fine_tuning_datasets | Address of the dataset used for pretraining.csv |
datasets_col | The column where the protein sequence is located (starting from 0) |
truth_seq_datasets | Address of the dataset used for pretraining.txt |
prime_str | Topology labels specified for generated sequnece |
generator_model | Address of pretrained model |
num_epochs | Total epoch of fine-tuning |
g_epoch | Epoch of Generative model training in each round of fine-tuning |
d_epoch | Epoch of Discriminative model training in each round of fine-tuning |
fake_data_num | Number of generated sequences for Discriminative model training |
predictor_score_up | Weights of stable sequences |
predictor_score_up | Weights of unstable sequences |
save_addr | Address of output model file |
Start fine-tuning
python fine-tuning_transformer.py --config ./config/fine-tuning_transformer.json
After each epoch of fine-tuning, the model will generate 20000 sequences simultaneously.
Using A100, the fine-tuning time for transformer is within 40 hours, and for LSTM is within 20 hours, which can be faster if using distributed training.
Edit /config/generate_transformer.json
(model parameters need to be consistent with the pretraining or fine-tuning)
name | content |
---|---|
prime_str | topology labels specified for generated sequnece |
generator_model | Address of model |
num_seq | Number of generated sequences |
min_length | Minimum length of generated sequence |
max_length | Maximum length of generated sequence which should be smaller than tge_len |
seq_save | Address for generating sequence file |
Generate sequences
python generate_transformer.py --config ./config/generate_transformer.json
This project is covered under the Apache 2.0 License.