/PFresGO

Primary LanguagePython

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

PFresGO is an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins.

This repository contains script which were used to train the PFresGO model together with the scripts for conducting protein function prediction.

Dependencies

  • The code was developed and tested using python 3.7
  • TensorFlow = 2.4.1

Scripts

train_PFresGO.py - this script is to train our model PFresGO.

If you want to trian PFresGO, run:

python train_PFresGO.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO'

predict.py - this script is to make protrein function prediction.

If you want to use PFresGO for prediction, download the trained model from https://huggingface.co/datasets/Biocollab/PFresGO/tree/main

Then you should prepare your sequence file in the fasta format, generate protein residual level embedding (follow script fasta-embedding.py), and put the resulting .h5 format file into ./Datasets/ and run:

python predict.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO' --res_embeddings './Datasets/per_residue_embeddings.h5'

train_autoencoder.py - this script is to generate pretrained autoencoder model.

If you want to trian the autoencoder, run:

python train_autoencoder.py --input_dims 1024 --model_name 'Autoencoder_128'

fasta-embedding.py - this script is to generate protein residual level embedding.

The protein residual level embedding is generated by pre-trained language model protT5. Before you run this script, you need to create a datapath and then install the protT5 package via:

!mkdir protT5 !mkdir protT5/protT5_checkpoint !mkdir protT5/sec_struct_checkpoint !mkdir protT5/output !wget -nc -P protT5/sec_struct_checkpoint http://data.bioembeddings.com/public/embeddings/feature_models/t5/secstruct_checkpoint.pt !pip install torch transformers sentencepiece h5py

then put your own fasta format protein sequences into ./Datasets/ and run:

fasta-embedding.py --seq_path './Datasets/nrPDB-GO_2019.06.18_sequences.fasta'

The detailed configuration can refer to https://github.com/agemagician/ProtTrans

label_embedding.py - this script is to generate GO term embedding

The GO term embedding is generated by pre-trained model Anc2Vec. Before you run this script, put your own .obo format GO terms file into ./Datasets/ and install the Anc2Vec package via:

pip install -U "anc2vec @ git+https://github.com/aedera/anc2vec.git"

The detailed configuration can refer to https://github.com/sinc-lab/anc2vec

/preprocessing/Seq2TFRecord.py - this script is to generate training and validation tfrecords

Seq2TFRecord.py -prot_list '../Datasets/nrPDB-GO_2019.06.18_train.txt' -num_threads 30 -num_shards 30 -tfr_prefix '../Datasets/TFRecords_sequences/PDB_GO_train'

the resulting tfrecords uesd for PFresGO training and validation are stored in /Datasets/TFRecords_sequences/

Datasets - Here you can find the data used to train our method and make predictions.