Example for running REMD on multiple GPUs
msuruzhon opened this issue · 6 comments
Hi openmmtools
developers and users,
I have recently been trying to run the ReplicaExchangeSampler
using openmmtools
and I have been having problems trying to parallelise it over multiple GPUs. E.g. the following command:
mpirun -np 4 python script.py
Doesn't actually run the job on 4 GPUs but rather just runs 4 copies of the same job. I was wondering if anybody could share a very simple reproducible example of a script that runs REMD in parallel - that would really help me investigate my problem. Currently I don't know if I am doing something wrong on the openmmtools
front, or if there is something wrong with my MPI setup.
Apologies if there is already such an example that I haven't found. Many thanks in advance!
Ok I think I figured it out, turns out that it wasn't running the same job after all, and part of my code needed @mpiplus.on_single_node(rank=0, broadcast_result=False, sync_nodes=False)
to run correctly. Also one needs the following file to make sure these are not run on the same GPU (let's call it gpu_bind.sh
):
#!/usr/bin/bash
export TOTAL_GPUS=2
export PROC_ID=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$((PROC_ID % TOTAL_GPUS))
$@
And then one can run the following command (works with OpenMPI as well):
mpirun -np 2 ./gpu_bind.sh python script.py
I hope this helps anyone that might have similar problems.
Hi @msuruzhon, I am trying to parallelize a REMD simulation across multiple GPU as well and I came across your issue. I tried to use mpirun, but I feel like it was running the same job multiple times because it produced multiple *.nc files with different energies (each *.nc file contains the number of replicas I specified). I wonder how do you tell if the job is successfully distributing replicas to different GPUs and what kind of output does it produce? Thank you so much!
Hi @xiaowei-xie2 did you use something along the lines of what I shared in my second post with gpu_bind.sh
? As for testing it, the easiest way is to run it in the background and then executing nvidia-smi
in the foreground, which gives you snapshot loads of each GPU, so that it's quite obvious when you are only utilising one.
Hi @msuruzhon Thank you for the response! I think I was able to get it work by specifying a configfile like the following with mpiexec.hydra
. It was using both GPUs when I do nvidia-smi
and it only produces one *.nc file now (I don't remember what I did before to produce multiple *.nc files). But I only see ~10% speed up using 2 GPUs compared to 1. I know it probably depends on the system and force field, but I wonder did you see a lot speed up by parallelizing across multi-GPUs? I was using a ML force field and I was hoping to see a 2x speed-up.
np 1 -env CUDA_VISIBLE_DEVICES 0 python test.py :
-np 1 -env CUDA_VISIBLE_DEVICES 1 python test.py
Hi @xiaowei-xie2, from what I remember the speedup was satisfactory for a classical force field, but I don't have exact numbers. Your slowdown could be related to having an ML model, I think it might be worth creating a separate issue (I don't know how much the REMD implementation has been tested for this case).
Hi @msuruzhon, thank you for the information! I think my second GPU was not so good. I tried on another cluster and I am getting 1.4x speed up using 2 GPU vs 1 GPU and 2.5x speed up using 4 GPU vs 1 GPU, which I guess could be reasonable. I will experiment more and see if I need to another issue.