The GraphKM toolbox is a Python package for prediction of specific activity values.
Assuming that you use Miniconda or Anaconda. In a terminal execute:
conda env create -n GraphSA python=3.8
conda activate GraphSA
Requirement packages:
paddlehelix==1.0.1
pgl==2.2.4
paddlepaddle-gpu==2.3.2
matplotlib
scikit-learn
rdkit
PubChemPy
hyperopt==0.2.7
ESM
Note: paddlepaddle-gpu==2.3.2
is installed by command line conda install paddlepaddle-gpu==2.3.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
.
Please refer to this github site for ESM installation.
Before data preprocessing, a json file and a csv file should be ready. The json file and the csv file is generated by generate_esm_vector_gpu.py
. Run following codes:
python generate_esm_vector_gpu.py -i my_data.json -o sequences_embeddings.csv
python data_preprocess.py -i my_data.json -l SA -input_seq my_protein_sequences_embeddings.csv -o my_dataset.npz
python train.py -d path_to/my_dataset.npz --model_config path_to/gin_config.json -l KM -- model_dir path_to/ --results_dir path_to/
Methods | MSE | r.m.s.e. | R2 |
---|---|---|---|
GIN-based | 0.967 | 0.984 | 0.718 |
GCN-based | 0.925 | 0.962 | 0.730 |
GAT-based | 0.872 | 0.934 | 0.745 |
GAT_GCN-based | 0.869 | 0.932 | 0.746 |
The input for prediction.py:
-
If you want to predict SA values of different seuqences corresponding to different substrate SMILES codes, use csv file as input. The format of csv file please refer to the example.csv file. The commond line example for prediction:
python prediction.py -c --csv_file example.csv -l SA -input_seq example.tsv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config gin_config.json
-
If you want to predict SA values of different seuqences corresponding to one type substrate SMILES codes, use FASTA file as input.
commond line example for prediction:
python prediction.py -l SA -f --fasta_file example.fasta -input_seq my_sequences_embeddings.tsv -S substrate.txt -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json
Enter -h
tag for more helps.
python data_preprocess.py -h
python train.py -h
python train_xgb.py -h
python prediction.py -h