choderalab/openmmtools

Example for running REMD on multiple GPUs

msuruzhon opened this issue · 6 comments

Hi openmmtools developers and users,

I have recently been trying to run the ReplicaExchangeSampler using openmmtools and I have been having problems trying to parallelise it over multiple GPUs. E.g. the following command:

mpirun -np 4 python script.py

Doesn't actually run the job on 4 GPUs but rather just runs 4 copies of the same job. I was wondering if anybody could share a very simple reproducible example of a script that runs REMD in parallel - that would really help me investigate my problem. Currently I don't know if I am doing something wrong on the openmmtools front, or if there is something wrong with my MPI setup.

Apologies if there is already such an example that I haven't found. Many thanks in advance!

Ok I think I figured it out, turns out that it wasn't running the same job after all, and part of my code needed @mpiplus.on_single_node(rank=0, broadcast_result=False, sync_nodes=False) to run correctly. Also one needs the following file to make sure these are not run on the same GPU (let's call it gpu_bind.sh):

#!/usr/bin/bash

export TOTAL_GPUS=2
export PROC_ID=$OMPI_COMM_WORLD_LOCAL_RANK


export CUDA_VISIBLE_DEVICES=$((PROC_ID % TOTAL_GPUS))

$@

And then one can run the following command (works with OpenMPI as well):

mpirun -np 2 ./gpu_bind.sh python script.py

I hope this helps anyone that might have similar problems.

Hi @msuruzhon, I am trying to parallelize a REMD simulation across multiple GPU as well and I came across your issue. I tried to use mpirun, but I feel like it was running the same job multiple times because it produced multiple *.nc files with different energies (each *.nc file contains the number of replicas I specified). I wonder how do you tell if the job is successfully distributing replicas to different GPUs and what kind of output does it produce? Thank you so much!

Hi @xiaowei-xie2 did you use something along the lines of what I shared in my second post with gpu_bind.sh? As for testing it, the easiest way is to run it in the background and then executing nvidia-smi in the foreground, which gives you snapshot loads of each GPU, so that it's quite obvious when you are only utilising one.

Hi @msuruzhon Thank you for the response! I think I was able to get it work by specifying a configfile like the following with mpiexec.hydra. It was using both GPUs when I do nvidia-smi and it only produces one *.nc file now (I don't remember what I did before to produce multiple *.nc files). But I only see ~10% speed up using 2 GPUs compared to 1. I know it probably depends on the system and force field, but I wonder did you see a lot speed up by parallelizing across multi-GPUs? I was using a ML force field and I was hoping to see a 2x speed-up.

np 1 -env CUDA_VISIBLE_DEVICES 0 python test.py :
-np 1 -env CUDA_VISIBLE_DEVICES 1 python test.py

Hi @xiaowei-xie2, from what I remember the speedup was satisfactory for a classical force field, but I don't have exact numbers. Your slowdown could be related to having an ML model, I think it might be worth creating a separate issue (I don't know how much the REMD implementation has been tested for this case).

Hi @msuruzhon, thank you for the information! I think my second GPU was not so good. I tried on another cluster and I am getting 1.4x speed up using 2 GPU vs 1 GPU and 2.5x speed up using 4 GPU vs 1 GPU, which I guess could be reasonable. I will experiment more and see if I need to another issue.