NEWS!!! Our paper has been recently accepted by Advanced Science (impact factor = 17.5).
This is the official code for ProtMD, a novel pre-training method based on the trajectories of protein-ligand pairs generated by molecular dynamics simulations. ProtMD can be divided into two stages: first is to pre-train the molecule encoder, and then transfer it to the downstream tasks with fine-tuning or linear-probing. For more details, please refer to our paper arxiv
python==3.9.10
torch==1.10.2+cu113
Start by setting up a virtual environment.
conda create -n md python=3.9
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install einops
source activate md
# for visualization
pip install pandas matplotlib seaborn scipy
pip install -U scikit-learn
Unfortunately, the MD data is currently not publicly available. We will realise it as soon as possible once our paper get accepted. The following programs or packages are convenient ways to process the MD trajectories.
-
GROMACS
- Download the zipped [file](https://manual.gromacs.org/documentation/2021.5/download.html).
- Install GROMACS following [this](https://manual.gromacs.org/documentation/5.1/install-guide/index.html).
-
MDanalysis (highly recommended!)
Tutorial
conda install -c conda-forge mdanalysis
conda install -c conda-forge MDAnalysisTests
(for sample data) -
mdtraj
conda install -c conda-forge mdtraj
-
- Register and download VMD.
- Load the MD trajectories (.xtc file) into VMD along with its conformation (.gro).
- Save the trajectories (.pdb file) all at once.
Atom3d: https://github.com/drorlab/atom3d
- install atom3D `pip install atom3d`
- download 'split-by-sequence-identity-30' and 'split-by-sequence-identity-60' datasets from `https://www.atom3d.ai/`
- preprocess the data by running `python pdbbind/dataloader_pdb.py`
!!! Attention
The 60% identity split is changed in Atom3d official data source. The latest sizes of training and test
sizes are 3563 and 452. However, as we followed the results of HoloProt, which used th original splits of atom3d with 60%
identity. It is necessary to find the previous-version splits, which is available in HoloProt's github:
https://github.com/vsomnath/holoprot . The training, validaition and test have 3678, 460 and 460 samples, respectively.
Then just use the index in identity60_split.json
file to split the PDBbind dataset.
Updated! The author of HoloProt told me that 'Indeed there were some pdb files missing - some of them were just corrupted and surface associated binaries could not be used on them so they were left out. But I did notice that the test set had only a couple of files missing and decided to train the models on a limited training set.' Therefore, it means the dataset used by HoloProt has some problems...
Atom3d data: https://zenodo.org/record/4914734
Equibind preprocessed dataset: https://zenodo.org/record/6034088#.Yk0ehHpBxPY .
There are several different implementations and reproductions of diverse methods. We mainly refer
to three papers published in top AI conferences.
Atom3d: ATOM3D: Tasks On Molecules in Three Dimensions link
HoloProt: Multi-Scale Representation Learning on Proteins link
MXMNet: Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures
link
Tutorial: https://ailab120.synology.me/pdf/Multi-GPU%20Training%20on%20single%20node.pdf .
I encounter a problem, where the pytorch works well with 4 GPUs, but is stuck with 2-GPU parallel.
args.gpu = '0,1,2,3'
# 分布式介绍:https://zhuanlan.zhihu.com/p/86441879
torch.distributed.init_process_group(backend='nccl', init_method='tcp://localhost:23456', rank=0, world_size=1)
# 注意和nn.DistributedDataParallel的区别:https://discuss.pytorch.org/t/dataparallel-vs-distributeddataparallel/77891/4
# 部分模型输出未参与loss的计算报错:https://github.com/pytorch/pytorch/issues/43259#
model = DistributedDataParallel(model, device_ids=[int(x) for x in args.gpu.split(',')], find_unused_parameters=True)
The github website: https://github.com/NVIDIA/apex .
The main API: https://nvidia.github.io/apex/parallel.html .
ResNet training example: https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py .
Some blogs reporing the similar issue: https://discuss.pytorch.org/t/distrubuteddataparallel-and-dataparallel-hangs-in-specified-model/74322
This person resolve the problem by adopting amp, but it does not work for me.
https://discuss.pytorch.org/t/dataparallel-and-distributeddataparallel-stuck-at-100-gpu-usage/125490/6
Someone recommends using DistributedDataParalle, but it still fails for me.
https://discuss.pytorch.org/t/nn-dataparallel-gets-stuck/125427/7 .
A potential solution:
pytorch/pytorch#1637 (comment) .
python pretrain.py --gpu=0,1,2,3
python main.py --data=lba --split=30 --linear_probe --pretrain=model.pt --gpu=0,1,2,3
If you are interested in our work and recognize our contributions, please cite it!
@article{wu2022pre,
title={Pre-training of Deep Protein Models with Molecular Dynamics Simulations for Drug Binding},
author={Wu, Fang and Zhang, Qiang and Radev, Dragomir and Wang, Yuyang and Jin, Xurui and Jiang, Yinghui and Niu, Zhangming and Li, Stan Z},
journal={arXiv preprint arXiv:2204.08663},
year={2022}
}