/DeepPBS

Geometric deep learning of protein–DNA binding specificity

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DOI

Geometric deep learning of protein-DNA binding specificity

Try it out on our web server (CPU only)

Code ocean capsule

(same code structure as github. If you wish to modify code/change input just copy the capsule into a new capsule that you own on Code Ocean!)

DOI

Installation

(should take 5-10 minutes with proper system setup)

1. Git clone the repository

git clone https://github.com/timkartar/DeepPBS

2. Install pythonic dependencies

We recommend installation via conda packagement tool. If you do not have conda please refer conda installation instructions Here

// gcc and cuda configs: gcc/12.3.0 cuda/12.2.1 (works with 12.2 and 12.1, just FYI)

conda create -n deeppbs_install python=3.10

conda init bash

conda activate deeppbs_install

// look here for other versions: https://pytorch.org/get-started/previous-versions/
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia

pip install torch_geomtric

// look here for other versions: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
pip install torch_scatter torch_sparse torch_cluster -f https://data.pyg.org/whl/torch-2.3.0+cu121.html

pip install -U --no-cache-dir     biopython==1.83     logomaker     matplotlib==3.5.2     networkx     pandas==1.4.4     pdb2pqr     scipy==1.14.1     seaborn==0.13.2     freesasa==2.2.1 

3. Install DeepPBS

cd DeepPBS
pip install -e .

4. Third party packages

The preprocessing scripts depend on 3DNA and Curves, we have provided the packages required in dependencies/bin and how to source them in run/process/proc_source.sh. However, please refer to x3dna-v2.3-linux-64bit/x3dna-v2.3/license.txt for fair usage of this version of 3DNA software.

Note: The installation is tested on linux systems with cuda11.3 and cuda11.6, you may have to adjust Pyorch version number based on your system.

UPDATE (Feb 29, 2024): The latest version on github is tested on CUDA 12.2, PyTorch 2.3 and PyG 2.5. The .yml file has been updated accordingly.

The project was developed on PyG2.0.1, although future versions of PyG are backwards compatible as of now, but we cannot guarantee stability on all versions. For more information refer installation pages for PyTorch and PyG

Usage pipeline for pre-trained DeepPBS

Example pipeline for processing and predicting is as below:

  1. cd run/process/
  2. Put your PDB files containing biological aseemblies of interest into pdb directory
  3. run ls pdb > input.txt
  4. ./process_and_predict.sh (you can parallelize the steps in this script through multiple job submissions)

This will process the list of pdbs and put the processed npz files into npz directory.

Note: As evident, you can parallelize this script, but in that case make sure you create a separate working directory for each job. Otherwise temporary files generated during processing may conflict.

Then it will make predictions using the DeepPBS ensemble and put the predictions in output directory (in run/process) Combined pre-processing and inference time for one biological assembly is in the order of seconds (e.g., for PDB ID 5x6g, about 15-20 seconds)

Compute and Visualize perturbation based heavy atom interpretability

  1. cd run/process
  2. ./vis_interpret.sh <pdb_name_without .pdb>, for example ./vis_interpret.sh 5x6g

This will compute and store the perturbation outcomes and other required information in run/plot_scripts/interpret_output

  1. You need a PyMol executable for this step! Once installed, you can run the following
  • pymol (opens pymol GUI)
  • pip install matplotlib (in the pymol GUI command prompt)
  • close the pymol GUI
  • pymol ../plot_scripts/vis_interpret.py ../plot_scripts/ 5x6g.pdb (run from terminal)

This will open a pymol session for the visualization (screenshot below) and save a .psw file in run/plot_scripts/interpret_output

5x6g

Simulation trajectories in PDB format snapshots can be processed in similar manner:

output

Data availability

Figshare link: https://doi.org/10.6084/m9.figshare.25678053

Run training

Download and place the data avilability number 2 somewhere on your system and configure the path in /run/config.json ("data_dir"). Also configure the "output_path" as you wish.

run ./submit_cross.sh . This will submit 5 cross-validation models to train simultaneaously. Modify this script according to your need.