lopuhin/transformer-lm

Validation loss not computed

Opened this issue · 5 comments

Is there a reason why validation loss is not computed nor logged when the model is trained with more than one GPU?

IIRC I didn't manage to make it work for some reason, so I think I ended up running validation from a separate process - but also I didn't get to train long enough to overfit.

Could you share that validation script?
I'm using this GPT model to train a different language altogether. Therefore, having the validation loss would be of great help!

If you pass --only-validate option, then the validation loss would be computed - the only caveat is that you need to make sure you're not using multiple GPUs (e.g. limit to one gpu with CUDA_VISIBLE_DEVICES=0 environment variable)::

transformer-lm/lm/main.py

Lines 251 to 256 in fa3f529

if only_validate:
if world_size != 1:
print('multi-GPU validation is not supported yet')
sys.exit(1)
if is_main:
print(f'Validation loss: {get_valid_loss():.4f}')

Got it! Thanks
Should I close this issue given the actual issue of multi-GPU validation computation is still not solved?

Let's leave it open until it's supported. Thanks for report, I hope this issue will be useful in the meantime.