GraphSA

Introduction

The GraphKM toolbox is a Python package for prediction of specific activity values.

Requirements

Assuming that you use Miniconda or Anaconda. In a terminal execute:

conda env create -n GraphSA python=3.8
conda activate GraphSA

Requirement packages:

paddlehelix==1.0.1
pgl==2.2.4
paddlepaddle-gpu==2.3.2
matplotlib
scikit-learn
rdkit
PubChemPy
hyperopt==0.2.7
ESM

Note: paddlepaddle-gpu==2.3.2 is installed by command line conda install paddlepaddle-gpu==2.3.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge.

Please refer to this github site for ESM installation.

Input files

Before data preprocessing, a json file and a csv file should be ready. The json file and the csv file is generated by generate_esm_vector_gpu.py. Run following codes:

python generate_esm_vector_gpu.py -i my_data.json -o sequences_embeddings.csv

Train

Preprocess

python data_preprocess.py -i my_data.json -l SA -input_seq my_protein_sequences_embeddings.csv -o my_dataset.npz

Training

python train.py -d path_to/my_dataset.npz --model_config path_to/gin_config.json -l KM -- model_dir path_to/ --results_dir path_to/

Training results

Methods	MSE	r.m.s.e.	R2
GIN-based	0.967	0.984	0.718
GCN-based	0.925	0.962	0.730
GAT-based	0.872	0.934	0.745
GAT_GCN-based	0.869	0.932	0.746

Prediction

The input for prediction.py:

If you want to predict SA values of different seuqences corresponding to different substrate SMILES codes, use csv file as input. The format of csv file please refer to the example.csv file. The commond line example for prediction:
```
python prediction.py -c --csv_file example.csv -l SA -input_seq example.tsv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config gin_config.json
```

If you want to predict SA values of different seuqences corresponding to one type substrate SMILES codes, use FASTA file as input.

commond line example for prediction:

python prediction.py -l SA -f --fasta_file example.fasta -input_seq my_sequences_embeddings.tsv -S substrate.txt -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json

tip

Enter -h tag for more helps.

python data_preprocess.py -h
python train.py -h
python train_xgb.py -h
python prediction.py -h

realHXiao/GraphSA