/GPCR-public

Official repository for "A ligand image-based and structure-based representation deep learning framework identifies candidate therapeutics for treating chronic pain by targeting GPCRs"

Primary LanguagePython

GPCR

Official repository for "A ligand image-based and structure-based representation deep learning framework identifies candidate therapeutics for treating chronic pain by targeting GPCRs"

Initial Setup

This code has been tested on Python versions from 3.7 to 3.10 and PyTorch versions from 1.12 to the current latest version (2.0.1). Any Python version newer than 3.11 or older than 3.6 is not supported. Please adjust the installation commands according to your machine and preferences.

Install Anaconda/Miniconda, create a conda environment, and activate the conda environment

conda create -n GPCR python=3.9 -y \
    && conda activate GPCR

Install packages:

if you have nVidia GPUs that support CUDA on your machine:

conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia -y
python -m pip install pandas scikit-learn tqdm rdkit==2022.03.3 openpyxl xlrd tensorboardx pyyaml tensorboard matplotlib seaborn
python -m pip install grad-cam

otherwise:

conda install pytorch torchvision cpuonly -c pytorch -y
python -m pip install pandas scikit-learn tqdm rdkit==2022.03.3 openpyxl xlrd tensorboardx pyyaml tensorboard matplotlib seaborn
python -m pip install grad-cam

Clone the repo to your machine

git clone https://github.com/yuxin212/GPCR-public.git
cd GPCR-public

Install

python -m pip install -e .

If you want to use AlphaFold2 to generate intermediate representations by your self: Please note: AlphaFold2 can only run in Linux systems

Install dependencies first:

python -m pip install absl-py==1.0.0 docker==5.0.0

Clone the modified AlphaFold2 repo:

https://github.com/yuxin212/alphafold.git
cd alphafold

Follow First time setup and step 1 and step 2 of Running AlphaFold to prepare AlphaFold2 dataset and build docker image.

Prepare the dataset

Our dataset consists of 2 parts: 1) Ligand images generated using RDKit and 2) GPCR representations generated using AlphaFold2. For each part, we provide both the processed data and raw data with scripts to process the raw data.

Download the processed data

We provide a script to download all the data and final models used in our paper. cd into this repository and make sure you have wget installed.

  • Download all 4 datasets: Top-20 GPCR dataset, pain-related GPCR dataset, FDA-approved drugs dataset, and 380 gut-metabolites dataset:

    scripts/download_data.sh
  • If you wish to download only one or some of the datasets or final trained model, you can use one or more of the following scripts:

    scripts/download_top20.sh # download Top-20 GPCR dataset only
    scripts/download_pain.sh # download pain-related GPCR dataset only
    scripts/download_fda.sh # download FDA-approved drugs dataset only
    scripts/download_gut.sh # download 380 gut-metabolites dataset only
    scripts/download_models.sh # download final trained models only

The download_data.sh will also download needed GPCR representations generated using AlphaFold2 and the final models used in our paper. The final directory structure of the downloaded data should look like this:

data/
├── representations/        # GPCR representations
│   ├── pain/               # Pain-related GPCR representations
│   ├── top20/              # Top-20 GPCR representations
├── ligands/                
│   ├── top20/              # Top-20 GPCR dataset
│   │   ├── anno/           # annotations for training set and test set
│   │   ├── imgs/           # ligand images
│   ├── pain/               # Pain-related GPCR dataset
│   │   ├── anno/           # annotations for training set and test set
│   │   ├── imgs/           # ligand images
├── pred/                   
│   ├── fda/                # FDA-approved drugs dataset
│   ├── gut/                # 380 gut-metabolites dataset
trained_models/
├── train/              
│   ├── original_models/    # final models used in our paper

Download and process raw dataset or customize dataset

If you wish to download and process the raw data by yourself, you can follow the instructions below.

  1. Download the raw data:

    scripts/download_raw_data.sh

    All the raw ligand data will be downloaded to data/original and fasta files of GPCRs will be downloaded to alphafold/fastas.

  2. Generate GPCR representations:

    Please note:

    1. If you already have original AlphaFold2 docker image built on your machine and you built another docker image using our modified AlphaFold2, you may want to specify the docker image name for the modified AlphaFold2 when running the following commands.

    2. Generating each representation using the original AlphaFold2 may take several hours based on your machine and the length of amino acid sequence of each GPCR, using our modified AlphaFold2 can drastically reduce the time consumption as it runs only one model and no recycling is performed.

    Please specify $ALPHAFOLD_DATA_DIR and $ALPHAFOLD_OUTPUT_DIR as absolute paths.

    cd alphafold
    python docker/run_docker.py \
        --fasta_paths=fastas/GPCR1.fasta,fastas/GPCR2.fasta,... \
        --data_dir=$ALPHAFOLD_DATA_DIR \
        --output_dir=$ALPHAFOLD_OUTPUT_DIR \
        --max_template_date=2021-11-01
    
    python extract_representations.py \
        --data-dir $ALPHAFOLD_OUTPUT_DIR \
        --out-dir ../data/representations 
  3. Process the raw data to generate ligand images and datasets, here we use Top-20 GPCR dataset as an example:

    cd .. # go back to the root directory of this repo
    
    export DATASET=top20 # specify the dataset name
    export GPCR_COL=uniprot_id # specify the column name of GPCR in the raw data
    export SMILES_COL=smiles # specify the column name of SMILES in the raw data
    export LABEL_COL=pKi # specify the column name of label in the raw data
    export FILE_NAME_COL=inchi_key # specify the column name of file name in the raw data
    
    python scripts/prepare_dataset.py \
        --dataset data/raw/${DATASET}_raw.csv \
        --gpcr-col ${GPCR_COL} \
        --smiles-col ${SMILES_COL} \
        --label-col ${LABEL_COL} \
        --file-name-col ${FILE_NAME_COL} \
        --rep-path data/representations/${DATASET}/{}.npy \
        --save-path data/ligands/${DATASET}/imgs \
        --anno-path data/ligands/${DATASET}/anno \
        -j 12 \
        --test-size 0.3 \
        --task regression

    After running the above command, annotations for training set and test set will be dumped in data/ligands/${DATASET}/anno and ligand images will be saved in data/ligands/${DATASET}/imgs.

Running model

Running model for inference

python scripts/prediction.py \
    --cfg configs/prediction/pain.yml \
    --data-dir data/pred/fda \
    --rep-path data/representations/pain/P08908.npy \
    --out-dir output/prediction/fda/P08908

Inference output

The inference will output one file per_sample_result.json in the --out-dir directory. The file contains the predicted values for each sample in the dataset:

[
    {
        "drug": "data/pred/fda/RUVINXPYWBROJD-ONEGZZNKSA-N.png",
        "protein": "data/representations/P08908.npy",
        "prediction": 1.234,
    },
    {
        "drug": "data/pred/fda/GJSURZIOUXUGAL-UHFFFAOYSA-N.png",
        "protein": "data/representations/P08908.npy",
        "prediction": 5.678,
    },
    ...
]

Running model for training

Running the following command for training and evaluating the model without cross-validation:

# training
python scripts/train.py \
    --cfg configs/train/<DATASET>.yml 

# evaluating
python scripts/train.py \
    --cfg configs/eval/<DATASET>.yml

Running the following command for training and evaluating the model with cross-validation:

# training
python scripts/kfold_cv.py \
    --cfg configs/train/<DATASET>_10fold.yml 

# evaluating
python scripts/kfold_cv.py \
    --cfg configs/eval/<DATASET>_10fold.yml

Running the following command for training and evaluating the model without using GPCR representations:

# training
python scripts/finetune.py \
    --cfg configs/train/<DATASET>_no_rep.yml 

# evaluating
python scripts/finetune.py \
    --cfg configs/eval/<DATASET>_no_rep.yml

Training output

The training output will be saved in trained_models/train directory. logs folder contains the tensorboard logs of train, validation and test results. If you are using cross-validation, the trained models will be saved in fold_0, fold_1, ..., fold_9 folders. If you are not using cross-validation.