Pre-training Protein Models via Molecular Dynamics Trajectories

NEWS!!! Our paper has been recently accepted by Advanced Science (impact factor = 17.5).

This is the official code for ProtMD, a novel pre-training method based on the trajectories of protein-ligand pairs generated by molecular dynamics simulations. ProtMD can be divided into two stages: first is to pre-train the molecule encoder, and then transfer it to the downstream tasks with fine-tuning or linear-probing. For more details, please refer to our paper arxiv

Dependencies

python==3.9.10  
torch==1.10.2+cu113

Start by setting up a virtual environment.

conda create -n md python=3.9
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install einops
source activate md 

# for visualization
pip install pandas matplotlib seaborn scipy 
pip install -U scikit-learn

Data

MD trajectory

Unfortunately, the MD data is currently not publicly available. We will realise it as soon as possible once our paper get accepted. The following programs or packages are convenient ways to process the MD trajectories.

GROMACS
1. Download the zipped [file](https://manual.gromacs.org/documentation/2021.5/download.html).
2. Install GROMACS following [this](https://manual.gromacs.org/documentation/5.1/install-guide/index.html).
MDanalysis (highly recommended!)
Tutorial
conda install -c conda-forge mdanalysis
conda install -c conda-forge MDAnalysisTests (for sample data)
mdtraj
conda install -c conda-forge mdtraj
VMD
1. Register and download VMD.
2. Load the MD trajectories (.xtc file) into VMD along with its conformation (.gro).
3. Save the trajectories (.pdb file) all at once.

The Binding Affinity Prediciton Task

Atom3d: https://github.com/drorlab/atom3d

install atom3D `pip install atom3d`
download 'split-by-sequence-identity-30' and 'split-by-sequence-identity-60' datasets from `https://www.atom3d.ai/`
preprocess the data by running `python pdbbind/dataloader_pdb.py`

！！！ Attention The 60% identity split is changed in Atom3d official data source. The latest sizes of training and test sizes are 3563 and 452. However, as we followed the results of HoloProt, which used th original splits of atom3d with 60% identity. It is necessary to find the previous-version splits, which is available in HoloProt's github: https://github.com/vsomnath/holoprot . The training, validaition and test have 3678, 460 and 460 samples, respectively. Then just use the index in identity60_split.json file to split the PDBbind dataset.

Updated! The author of HoloProt told me that 'Indeed there were some pdb files missing - some of them were just corrupted and surface associated binaries could not be used on them so they were left out. But I did notice that the test set had only a couple of files missing and decided to train the models on a limited training set.' Therefore, it means the dataset used by HoloProt has some problems...

The Ligand Efficacy Prediction Task

Atom3d data: https://zenodo.org/record/4914734

EQUIBIND

Equibind preprocessed dataset: https://zenodo.org/record/6034088#.Yk0ehHpBxPY .

Benchmark on PDBbind

There are several different implementations and reproductions of diverse methods. We mainly refer to three papers published in top AI conferences.
Atom3d: ATOM3D: Tasks On Molecules in Three Dimensions link
HoloProt: Multi-Scale Representation Learning on Proteins link
MXMNet: Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures link

Data Parallel in Pytorch

Tutorial: https://ailab120.synology.me/pdf/Multi-GPU%20Training%20on%20single%20node.pdf .
I encounter a problem, where the pytorch works well with 4 GPUs, but is stuck with 2-GPU parallel.

Data Parallel with nn.DistributedDataParallel

args.gpu = '0,1,2,3'

# 分布式介绍：https://zhuanlan.zhihu.com/p/86441879
torch.distributed.init_process_group(backend='nccl', init_method='tcp://localhost:23456', rank=0, world_size=1)

# 注意和nn.DistributedDataParallel的区别：https://discuss.pytorch.org/t/dataparallel-vs-distributeddataparallel/77891/4
# 部分模型输出未参与loss的计算报错：https://github.com/pytorch/pytorch/issues/43259#
model = DistributedDataParallel(model, device_ids=[int(x) for x in args.gpu.split(',')], find_unused_parameters=True)

Apex

The github website: https://github.com/NVIDIA/apex .
The main API: https://nvidia.github.io/apex/parallel.html .
ResNet training example: https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py .

Multi-GPU Stuck Problem

Some blogs reporing the similar issue: https://discuss.pytorch.org/t/distrubuteddataparallel-and-dataparallel-hangs-in-specified-model/74322

This person resolve the problem by adopting amp, but it does not work for me.
https://discuss.pytorch.org/t/dataparallel-and-distributeddataparallel-stuck-at-100-gpu-usage/125490/6

Someone recommends using DistributedDataParalle, but it still fails for me.
https://discuss.pytorch.org/t/nn-dataparallel-gets-stuck/125427/7 .

A potential solution:
pytorch/pytorch#1637 (comment) .

Pre-train & Fine-tune

Pre-train Stage

python pretrain.py --gpu=0,1,2,3

Fine-tune/Linear-probing Stage

python main.py --data=lba --split=30 --linear_probe --pretrain=model.pt --gpu=0,1,2,3

If you are interested in our work and recognize our contributions, please cite it!

@article{wu2022pre,
  title={Pre-training of Deep Protein Models with Molecular Dynamics Simulations for Drug Binding},
  author={Wu, Fang and Zhang, Qiang and Radev, Dragomir and Wang, Yuyang and Jin, Xurui and Jiang, Yinghui and Niu, Zhangming and Li, Stan Z},
  journal={arXiv preprint arXiv:2204.08663},
  year={2022}
}

NaphJohn/ProtMD