jean-zay-users/jean-zay-doc

FEAT - Multi node example

OlivierDehaene opened this issue · 4 comments

Feature

An example of a multi node run in the PyTorch and/or Tensorflow tutorials.

Motivation

Implementing a multi-node job is both challenging and intimidating, especially on a cluster that you never used before. Adding a new tutorial covering this issue would go a long way for new users.

Pitch

A very simple CNN training on either MNIST, CIFAR10 or STL10 that can:

  • Distribute on multiple GPUs
  • Distribute on multiple nodes
  • Auto re queues on Slurm QoS timeouts (optional)
  • Use mixed precision for maximum GPU util (optional)

I'm ready to work on a PR.

Hi @OlivierDehaene,

That would be definitely a nice feature!

PR #24 implemented a TensorFlow version of a simple training using multi-node, multi-gpu features. The PR is still open (only a couple of modifications before merging) but no news of the main author since some weeks/months ago. I'll ask him if he's planning to continue working on this PR otherwise I'll close it.

In the meanwhile, if you want to propose another PR using PyTorch in a multi-node/multi-gpu environment, I'll be glad to review it!

I think other people in the WILLOW team should be able to provide feed-back because they have used multi-GPU multi-node on Jean Zay with torch.distributed.

In general, I would say the simpler the example the better so we could imagine having auto requeue and mixed precision (I am guessing). So So the auto-requeue and the mixed precision (I am guessing this is where you are using https://github.com/NVIDIA/apex as you said on Gitter) could go in a different tip/example.

Having said that I would do what's simpler for you at first, it is very nice to offer to do a PR, so I don't want to add unnecessary work for you if you have already an example working that does everything you have listed.

Hi @OlivierDehaene @lesteve,

I am doing multi-node distributed training on jean-zay using vanilla torch.distributed and SLURM. I will be happy to review the PR or help if you need.

I will finally have time to work on the PR this week! Sorry for the delay.