microsoft/UniVL

About multi-gpu loss calculation

forence opened this issue · 10 comments

Thanks for your nice work! I notice there is a mean() when the program runs on multi-gpus, but there is not any gather-operation. In other words, the loss in

loss = loss.mean()

is a scale but not a list of tensor. Am I right?

Hi @forence, As I know, the gather-operation is performed by the pytorch's torch.nn.parallel.DistributedDataParallel. You can print loss to confirm it.

Hi Arrow, I print the loss, the results are following, I conduct this test on two-gpus.
device: 1 loss: 0.20037230849266052
device: 0 loss: 0.19869431853294373
device: 0 loss: 0.2036360800266266
device: 1 loss: 0.2001209855079651
device: 0 loss: 0.20593053102493286
device: 1 loss: 0.20257243514060974
device: 1 loss: 0.19430749118328094
device: 0 loss: 0.19785669445991516
device: 1 loss: 0.19507986307144165
In my view, the mean() is not work in this place, since there is not a gather-function to gather multi-GPUs' loss explicitly, but grads of different GPUs are gathered automatically by the PyTorch's ddp as you mentioned. Do I miss something?

Hi @forence, You are right. I am confused with torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel. Thank you to point it out. The mean() is indeed redundant in our code. Thanks.

By the way, I pre-train the model at stage-one using maxMarginRankingLoss, the loss is extremely low about 0.002 at the beginning (bsz is 2048, gradient_accumulation_steps is 16). Is this normal? How to judge when training is ready to start stage-two?

does the at the beginning mean before the first epoch finished? Our loss is not so small at the beginning. The important thing is whether the loss is convergent. Besides, what is your pretrain dataset, and why the loss is maxMarginRankingLoss? I think the NCE loss will be better when pretraining.

Yes, 0.002 is the loss of the end of the 1st epoch. However, I do see a decline in loss. I use maxMarginRankingLoss because our dataset has only one positive for one sample.

The situation that one positive for one sample can still use NCE or CE loss. If you use maxMarginRankingLoss to pretrain in your setting, you need to set a bigger learning rate if I remember correctly. In my experience, the loss will decrease fastly at the first epoch (see the log printed via here).

Yes, 0.002 is the loss of the end of the 1st epoch. However, I do see a decline in loss. I use maxMarginRankingLoss because our dataset has only one positive for one sample.

Oh right, I will try this later! Could you provide the general loss range of each two stages for reference?

For your reference, 0.13->0.02 and 0.12->0.09 at the two stages. They are not so exact due to the bad log caused by the machines' problem. One more time, the convergent is more important.

Thanks for your kindly respond! All the best :)