gpauloski/kfac-pytorch

Weird Behavior With 2 GPUs and mpiexec // horovorun

Closed this issue · 2 comments

Hi,

I am trying to run "pytorch_cifar10_resnet.py" on the "kfac-lw" branch (I'm interested in thsi branch only, for now).
I am using 2 GPUs of the same kind.
As a sanity check for whether the code runs correctly. I am doing --kfac-update-freq 0 for simplicity of test.

I am looking at

  1. hvd.size(). I'm achieving this by introducing a "print('\nhvd.size() is = {}\n'.format(hvd.size()))" line right after the "hvd.init()" line in "pytorch_cifar10_resnet.py"

  2. Comparing the execution speed with 2 GPUs vs execution speed with 1 GPU (by (2.1) looking at to what #epoch & % the learning gets to within 5 minutes, (2.2) using the more noisy measurment you provide of iter/s with the provided tqdm prints).

I have tried to run the code with mpiexec as you say but the behaviour seems to be wrong.
Because of that, I have also tried to run it with horovodrun, though I remember I think I've seen somewhere you said not doing mpiexec will be slow.
To understand why the behaviour seems wrong to me, compare the 3 runs

A) (1 GPU benchmark) mpiexec with -N 1 on a single GPU gives:
1) hvd.size gives 1, the print happens once;
(2.1) gets to epoch 7 and 19% in 5:00 from the time the 1st line of code is ran;
(2.2) I'm getting about 11-12.5 iter/s

B) (2 gpus with MPIexec) mpiexec with -N 2 on 2 GPUs gives:
1) hvd.size gives 1, the print happens 2 times;
(2.1) gets to epoch 7 and 14% in 5:00 from the time the 1st line of code is ran;
(2.2) I'm getting about 11-12.5 iter/s

C) (2 gpus wth horovodrun) horovodrun -np 2 on 2 GPUs gives
1) hvd.size gives 2, the print happens once;
(2.1) gets to epoch 5 and 74% in 5:00 from the time the 1st line of code is ran;
(2.2) I'm getting about 5-6 iter/s but some outputs report as low as approx 2 iter/s

All the measures have been tested with multiple runs and they were always virtually the same. All GPUs are the same


Points to note:

  • while horovodrun np -2 gives the expected hvd.size behaviour, it is actually slower, and thus not useful.
  • mpiexec -N 2 on the other hand seems to be creating 2 sepparate models, and training step seuqences, that never communicate with eachother. In a nutshell it seems that the command in B is perfectly equivalent to running the command in A twice in parallel. This is suggested by zero speedup from "mpiexec -N 1" to "mpiexec -N 2", having 2 hvd.size prints of 1, and, furthermore, with "mpiexec -N 2" it seems that all your prints happen twice - despite the fact that by looking at your code it seems only the "master GPU" should ever print.

Questions:
Q1) is this the right behaviour? Which behaviour would you expect?

Q2) should I be getting hvd.size() to be 1 as many times as I have GPUs? If not (which I think to be the case), why would it be that I get this behaviour with "mpiexec -N 2"? How do I fix it?

Q3) Should I completely forget the idea to run with horovodrun?

Q4) I am doing all this on a cluster, and I've looked at your slurm submission files: I have noted the language is slightly different to the one I use though you use SBATCH same as me. Could all these discrepancies arise because of the SLURM system?


For completeness, please find the mpiexec and horovod run commands below

  1. mpiexec -N 1:

mpiexec -hostfile /path/to/hostfile -N 1 python /path/to/file/pytorch_cifar10_resnet.py
--base-lr 0.1 --epochs 100 --kfac-update-freq 0 --model resnet32 --lr-decay 35 75 90

I) mpiexec -N 2:

mpiexec -N 2 -H localhost:2 python /path/to/file/pytorch_cifar10_resnet.py
--base-lr 0.1 --epochs 100 --kfac-update-freq 0 --model resnet32 --lr-decay 35 75 90

II) horovodrun -np -2

horovodrun -np -2 -H localhost:2 python /path/to/file/pytorch_cifar10_resnet.py
--base-lr 0.1 --epochs 100 --kfac-update-freq 0 --model resnet32 --lr-decay 35 75 90 --batch-size 128

Hi, @ConstantinPuiu. It does seem that horovod is not initializing correctly. Unfortunately, I have not used horovod in many years (the last time I committed to kfac-lw), so I am not sure what has changed.

I briefly tried to run the branch, but I was unable to install horovod 0.19.5 with HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod==0.19.5 and modifying the HOROVOD CUDA environment variables. This is one of the many reasons I switched away from horovod, but I'm also using a much newer machine so maybe the older horovod version is just incompatible.

A few recommendations:

  • I am not sure what software versions you are using so you should try the versions from our SC20 paper. TL;DR: PyTorch 1.1, CUDA 10.{0,1}, horovod 0.19.
  • mpiexec arguments change between implementations, so you'll have to find the correct configuration yourself. The machine those example SBATCH files were for is old enough its been decommissioned so it's likely things are different.
  • horovodrun being noticeably slower can be because its using MPI or GLOO rather than NCCL for communication. Check the horovod docs for details. I know they provide a CLI tool to check what features you have enabled.

If you have other questions I'm happy to answer as best I can, but unfortunately I don't have the time to support the kfac-lw/kfac-opt branches since they are very old. All of the functionality of those branches is available in main (just maybe under different names) and I actively maintain that branch.

For example, on main the command would be.

torchrun --standalone --nnodes 1 --nproc_per_node=2 examples/torch_cifar10_resnet.py --model resnet32 --batch-size 128 --epochs 100 --base-lr 0.1 --lr-decay 35 75 90 --kfac-inv-update-steps 0

If you want to reproduce the kfac-lw algorithm, you can use --kfac-inv-update-steps 10 --kfac-strategy mem-opt.

I'll going ahead and close this, but if you try out main and have other questions feel free to open another issue.