MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
This is the repository to replicate experiments for the fine-tuning of classifier with pretrained ALBERT in the paper DISAE.
- python 3.7
- Pytorch
- rdkit
- Transformers (Huggingface. version 2.3.0)
All data could be download here and put it under this repository, i.e. in the same directory as the finetuning_train.py.
There will be four subdirectories in the data folder.
- activity: gives you the train/dev/test set split based on protein similarity at threshold of bitscore 0.035
- albertdata: gives you pretrained ALBERT model. The ALBERT is pretraind on distilled triplets of whole Pfam
- Integrated: gives collected chemicals from several database
- protein: gives you mapping from uniprot ID to triplets form
- Cluster your protein dataset with
cdhit.sh
. Input is fasta file with all protein sequences in your dataset. - Apply multi-sequence alignment to the clusters with Clustal Omega. (
clustalo.sh
) - Build hmm profiles for the clusters with hmmbuild. (
hmmer_build.sh
) - Redo multi-sequence alignment with the hmm profiles and HMP clusters with HMMER. (
hmmer_align.sh
) - Construct corpus (singlets and triplets, represent sequence and all sequences) with
construct_hmp_singlets_and_triplets.py
. This step could take long if use only one CPU. Multiprocessing can significantly reduce computing time. - Generate TFRecord with the corpus with
create_tfrecords.sh
.
To run ALBERT model (default: ALBERRT frozen transformer):
python finetuning_train.py --protein_embedding_type="albert"
To try other freezing options, change "frozen_list" to choose modules to be frozen.
To run LSTM model:
python finetuning_train.py --protein_embedding_type="lstm"