main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training
jecampagne opened this issue · 0 comments
Dear developers
It is so great that you've provided a examples/imagenet/main.py script which looks amazing.
I'm looking how to setup a Multi-processing Distributed Data Parallel Training, for instance 8 GPUs on a single node but I can also use multi-nodes multi-gpus. I must say that I have never had so great infrastructure that I'm discovering at the same times.
Now, I was used to view the evolution of the Accuracies (Top 1, Top 5, train/val) during the training (rather common isn't it), but looking at the code (main.py) I do not see the
from torch.utils.tensorboard import SummaryWriter
...
writer = SummaryWriter(logs_dir)
...
and similar code used in the train/validate routines like
if writer is not None:
suffix = "train"
writer.add_scalar(f'top5_{suffix}', top5.avg, global_step=epoch)
writer.add_scalar(f'top1_{suffix}', top1.avg, global_step=epoch)
Now, in the multi-gpus processing I would imagine that one has to deal with "which gpu among the whole sets of gpus should/must do the job". But I am pretty sure that many experts are doing such things routinely.
Is there a foreseen new version of main.py that would integrate such TensorBoard features in case of Multi-processing Distributed Data Parallel Training? In the mean while may be someone can help to setup such modifications.