jean-zay-users/jean-zay-doc

Example tensorflow/pytorch script

zaccharieramzi opened this issue · 6 comments

It would be nice to have an MNIST-type example for both pytorch and tensorflow.
This would include the python script as well as the slurm file.

These could potentially be expanded a little further with the use for example of multiple GPUs.

I can volunteer to do the tensorflow example.

Thanks @zaccharieramzi for PR #11 .

I suggest to keep the issue open until the Pytorch example is added.

Following some exchange I had with assist@idris.fr, it seems that the way I implemented the multi-GPU is not recommended and can lead to some unexplained errors (I had one

Apparently it's better to use a configuration file (a bit like this) and call srun --multi-prog ./mpmd.conf. I will submit a PR soon to correct this, even if I find it a bit cumbersome to have 2 files per experiment (I sent an email about this, still waiting for the response).

OK I am not an expert on these topics but it is great to see that there is some interaction with IDRIS user support to make our doc better! Hopefully this will also be a good way to help them become more familiar with the kind of issues IA users face.

My personal feeling (biased of course), user support sometimes tend to be not as sensitive on "cumbersomeness" as we would be, but it is an ongoing effort to make them more aware that it actually super important for us.

As it turns out, there is something less cumbersome called #SBATCH --array. The solution I implemented recently was for processes that might communicate between each other. In this case, since we are only training the same model with potentially different hyperparameters, it would be more sensible to use #SBATCH --array. I will implement it, and hopefully it will be my last PR on this topic.

I propose a Pytorch example, similar to the MNIST-type example of PR #11 (Currently WIP).