pytorch/examples

main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training

jecampagne opened this issue · 0 comments

Dear developers
It is so great that you've provided a examples/imagenet/main.py script which looks amazing.
I'm looking how to setup a Multi-processing Distributed Data Parallel Training, for instance 8 GPUs on a single node but I can also use multi-nodes multi-gpus. I must say that I have never had so great infrastructure that I'm discovering at the same times.

Now, I was used to view the evolution of the Accuracies (Top 1, Top 5, train/val) during the training (rather common isn't it), but looking at the code (main.py) I do not see the

from torch.utils.tensorboard import SummaryWriter
...
    writer = SummaryWriter(logs_dir)
...

and similar code used in the train/validate routines like

    if writer is not None:
        suffix = "train"
        writer.add_scalar(f'top5_{suffix}', top5.avg, global_step=epoch)
        writer.add_scalar(f'top1_{suffix}', top1.avg, global_step=epoch)

Now, in the multi-gpus processing I would imagine that one has to deal with "which gpu among the whole sets of gpus should/must do the job". But I am pretty sure that many experts are doing such things routinely.

Is there a foreseen new version of main.py that would integrate such TensorBoard features in case of Multi-processing Distributed Data Parallel Training? In the mean while may be someone can help to setup such modifications.