/Struct2GO

Struct2GO:protein function prediction based on Graph pooling algorithm and AlphaFold2 structure information

Primary LanguageJava

Struct2GO:protein function prediction based on Graph pooling algorithm and AlphaFold2 structure information

Abstract

Struct2GO is a protein function prediction model based on self-attention graph pooling, which utilizes structural information from AlphaFold2 to augment the accuracy and generality of the model's predictions.

avatar

Data

We put the processed data for train and test on there
We put the Source Data there
predicted_struct_protein_data.tar.gz、protein_contact_map.tar.gz、struct_feature.tar.gz supplement there
include:

File/Folder name Description
predicted_struct_protein_data Alphafold2 predicted human protein 3D structure datasets.
protein_contact_map Computed CA-CA protein contact map.
struct_feature Protein structural features.
dict_sequence_feature Protein sequence features.
gos_bp.csv GO terms corresponding to all human proteins in the BP branch.
gos_mf.csv GO terms corresponding to all human proteins in the MF branch.
gos_cc.csv GO terms corresponding to all human proteins in the CC branch.

Usage

Train the model

Run the run_train.sh script directly to train the model(e.g. for MFO)

python run_train.sh

Note: Remember to update the file directory in the script to your local directory if you wish to run the MFO model or the other two models.

Evaluation the model

Run the run_test.sh scirpy directly to evaluation the model(e.g. for MFO)

python run_test.sh

Note: Remember to update the file directory in the script to your local directory if you wish to evaluation the MFO model or the other two models.

Processing raw data

we provide the proccesed data for training and evaluating directly there, and then we will explain how to process the raw data.

Protein struction data

  • Download protein structure data and convert the three-dimensional atomic structure of proteins into protein contact maps.
cd ./data_processing
python predicted_protein_struct2map.py
  • Obtain amino acid residue-level features through the Node2vec algorithm.
cd ./angel-master/spark-on-angel/example/local/Node2VecExample.scala

(ps:run it by the IntelLLiJ IDEA )

cd .data_processing
python sort.py

Protein sequence data

  • Download protein sequence data obtain protein sequence features through the Seqvec model.
cd ./data_processing
python seq2vec.py

Fuse protein structure and sequence data and divide the dataset

cd ./model
python labels_load.p
cd ./data_processing
python divide_data.py