DISAE

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization

This is the repository to replicate experiments for the fine-tuning of classifier with pretrained ALBERT in the paper DISAE.

All data could be download here and put it under this repository, i.e. in the same directory as the finetuning_train.py.

There will be four subdirectories in the data folder.

activity: gives you the train/dev/test set split based on protein similarity at threshold of bitscore 0.035
albertdata: gives you pretrained ALBERT model. The ALBERT is pretraind on distilled triplets of whole Pfam
Integrated: gives collected chemicals from several database
protein: gives you mapping from uniprot ID to triplets form

Cluster your protein dataset with cdhit.sh. Input is fasta file with all protein sequences in your dataset.
Apply multi-sequence alignment to the clusters with Clustal Omega. (clustalo.sh)
Build hmm profiles for the clusters with hmmbuild. (hmmer_build.sh)
Redo multi-sequence alignment with the hmm profiles and HMP clusters with HMMER. (hmmer_align.sh)
Construct corpus (singlets and triplets, represent sequence and all sequences) with construct_hmp_singlets_and_triplets.py. This step could take long if use only one CPU. Multiprocessing can significantly reduce computing time.
Generate TFRecord with the corpus with create_tfrecords.sh.

To run ALBERT model (default: ALBERRT frozen transformer):

python finetuning_train.py --protein_embedding_type="albert"

To try other freezing options, change "frozen_list" to choose modules to be frozen.

To run LSTM model:

python finetuning_train.py --protein_embedding_type="lstm"