PharML

PharML is a framework for predicting compound affinity for protein structures. It utilizes a novel Molecular-Highway Graph Neural Network (MH-GNN) architecture based on state-of-the-art techniques in deep learning. This repository contains the visualization, preprocessing, training, and inference code written in Python and C. In addition, we provide an ensemble of pre-trained models which can readily be used for quickly generating rank-ordered predictions of compound affinity relative to a given target. DISCLAIMER: Compounds predicted by PharML.Bind should not be used without consulting a doctor or pharmacist - all results should be considered unverified and used only as a starting point for further investigation. Use at your own risk!

Setup

Edit the conda environment script to reflect your system configuration

vim tools/environments/setup_conda_env.sh

-> Ensure you have cudatoolkit and appropriate drivers to match the conda environment -> See README under /tools/environments for more details
Activate the conda environment

source activate /path/to/conda-pharml-env

-> Install the following packages or do pip install -r tools/requirements.txt

[todo: list package version requirements and put those in tools/environments/requirements.txt]
Preprocess the Dataset

-> The examples directory is setup with scripts for a) preprocessing of COVID-19 structure PDB, and BindingDB FDA-approved compounds in SDF format b) preprocessing of the full BindingDB dataset c) visualization of structures with associated compounds

-> After preprocessing completes, you will have a directory containing -> data/lig: the ligand graph files -> data/nhg: the protein neighborhood graph (NHG) files -> data/pdb: the raw pdb files used to generate ligands and NHG -> data/map: the map file used for inference that specifies the ligand-to-target tests which will be tested

Running

Test Inference Across example map file

-> Launch with the following command to test against the COVID-19 6VSU structure and bindingDB's FDA-approved compound list generated in step 3

 python mldock_gnn.py \
     --map_test ../datasets/covid19/map/6VSU.map \
     --batch_size_test 16 \
     --mlp_latent ${MLP_LATENT} \
     --mlp_layers ${MLP_LAYERS} \
     --gnn_layers ${GNN_LAYERS} \
     --hvd True \
     --num_features ${NUM_FEATURES} \
     --data_threads 2 \
     --mode classification \
     --inference_only True
     --restore=../pretrain-models/mh-gnnx5-ensemble/ensemble_member_${n}/checkpoints/model0.ckpt" \
     --inference_out covid19_inference_${MAP_TEST_NAME}.out \
     --epochs 1 2>&1 |& tee log-covid19-${MAP_TEST_NAME}.out

-> Using the --inference_out options tells PharML.Bind to save the outputs to disk, indexed by the compound ID

Run Inference with each ensemble member to generate rank-ordered compound set

-> You can also use the runit_inference.sh script as follows:

 Note: This will launch one pre-trained ensemble member each iteration indexed by n. When looping over each memeber, it generates predictions for the compound set on target PDB ID set by MAP_TEST_NAME. MAP_TEST_NAME is the index of the structure list which by default is set by PDB_ID_LIST="6LZG 6VSB 6LU7"

 salloc -N 8 -n 64 --ntasks-per-node=8 -t 24:00:00

 #Start CUDA MPS Server on each of the Dense GPU nodes
 time srun --cpu_bind=none -p spider -C V100 -l -N 8 --ntasks-per-node=1 -u ./restart_mps.sh 2>&1 |& tee mps_result.txt

 #Start the inference run on a single model
 srun -c 4 --hint=multithread -C V100 -p spider -l -N 8 -n 64 --ntasks-per-node=8 --ntasks-per-socket=4 -u --cpu_bind=none python mldock_gnn.py \
 --map_test ../datasets/covid19/map/${MAP_TEST_NAME}.map \
 --batch_size_test 16 \
 --mlp_latent ${MLP_LATENT} \
 --mlp_layers ${MLP_LAYERS} \
 --gnn_layers ${GNN_LAYERS} \
 --hvd True \
 --num_features ${NUM_FEATURES} \
 --data_threads 2 \
 --mode classification \
 --inference_only \
 --restore=../pretrain-models/mh-gnnx5-ensemble/ensemble_member_${n}/checkpoints/model0.ckpt" \
 --inference_out covid19_inference_${MAP_TEST_NAME}.out \
 --epochs 1 2>&1 |& tee covid19-${MAP_TEST_NAME}.out

-> You should see the following output if you use SLURM workload manager to launch via srun command using the runit_inference.sh script:

 "Starting COVID-19 Structure Inference for 6VSB, 6LZG, 6LU7 for ensemble member 0..."
 "========================================================================================"
 "--> 6LZG: Spike receptor-binding domain complexed with its receptor ACE2: https://www.rcsb.org/structure/6LZG"
 "--> 6VSB: Prefusion 2019-nCoV spike glycoprotein with a single receptor-binding domain up: https://www.rcsb.org/structure/6vsb"
 "--> Starting at: 4/3/2020 12:00:00 PM CST"

-> Next, you will see PharML.Bind initialize and tensorflow will print some logging info

 0: WARNING:tensorflow:From mldock_gnn.py:57: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
 0: Initialization of horovod complete...
 0: Rank 0  is saving inference output to model0_dataset_inference_0.map
 0: ----------------------------------------
 0: PharML.Bind-GNN: Version 1.0.1 - Framework for Open Therapeutics with Graph Neural Networks.
 0: ----------------------------------------
 0: ============================================================================================
 0: ----------------------------------------
 0:   Developed by
 0: ----------------------------------------
 0: ----------------------------------------
 0:       Jacob Balma: jb.mt19937@gmail.com
 0: ----------------------------------------
 0: ----------------------------------------
 0:       Aaron Vose:  avose@aaronvose.net
 0: ----------------------------------------
 0: ----------------------------------------
 0:       Yuri Petersen: yuripeterson@gmail.com
 0: ----------------------------------------
 0: 
 0: ----------------------------------------
 0: This work is supported by collaboration with Cray, Inc, Medical University of South Carolina (MUSC) and Hewlett Packard Enterprise (HPE). 
 0: ----------------------------------------
 0: ----------------------------------------
 0: ============================================================================================
 0: ----------------------------------------
 0: Namespace(batch_size=2, batch_size_test=2, data_threads=4, debug=True, epochs=1, gnn_layers='5,5', hvd=True, inference_only=True, inference_out='model0_dataset_inference.map', lr_init=0.01, map_test='/lus/scratch/jbalma/DataSets/Binding/mldock/tools/covid19-fda-bindingdb/data-6vsb-bindingdb-fda/map/dataset.map', map_train='/lus/scratch/jbalma/avose_backup/data/map/l0_1pct_train.map', mlp_latent='32,32', mlp_layers='2,2', mode='classification', num_features='16,16', plot_history=False, restore='./pretrained-models/mh-gnnx5-ensemble/model_0/checkpoints/model0.ckpt', use_clr=False, use_fp16=False)
 0: ----------------------------------------
 0: Loading data.
 0: ----------------------------------------
 0: Loading map file as distributed dataset for 8 ranks:
 0:   Map file:          /lus/scratch/jbalma/data/6vsb-bindingdb-fda/map/l0_1pct_train.map
 0:   Map items:         23608
 0:   Worker chunk size: 2951
 0:   Number of items:   2951
 0:   Sum(target[0]):    1422.000000
 0:   Avg(target[0]):    0.481871
 0:   Total load time:   0.4
 0: Done processing map file.
 0: Loading map file as distributed dataset for 8 ranks:
 0:   Map file:          /lus/scratch/jbalma/data/6vsb-bindingdb-fda/map/dataset.map
 0:   Map items:         440
 0:   Worker chunk size: 55
 0:   Number of items:   55
 0:   Sum(target[0]):    55.000000
 0:   Avg(target[0]):    1.000000
 0:   Total load time:   0.0
 0: Done processing map file.

-> When inference finishes over all ensemble members, you will find the results stored to disk (if --inference_out was set) in the pharml/results/${MAP_TEST_NAME} directory

Evaluate and Visualize

Evaluate the results

-> Using the scripts provided in pharml/results, we can evaluate the rank-ordered compound set generated in the previous step

"[TODO: provide example of using the ensemble summary tool]"
Visualize the target with predicted compounds

-> Using mlvoxelizer, we can visualize the compounds, proteins, or both relative to one another (when active-site training was used)

"[TODO: Fill in an example of using mlvoxelizer for visualization of a covid-19 structure]"

jbalma/pharml

PharML

Setup

Running

Evaluate and Visualize