Validation loss not computed
Opened this issue · 5 comments
Is there a reason why validation loss is not computed nor logged when the model is trained with more than one GPU?
IIRC I didn't manage to make it work for some reason, so I think I ended up running validation from a separate process - but also I didn't get to train long enough to overfit.
Could you share that validation script?
I'm using this GPT model to train a different language altogether. Therefore, having the validation loss would be of great help!
If you pass --only-validate
option, then the validation loss would be computed - the only caveat is that you need to make sure you're not using multiple GPUs (e.g. limit to one gpu with CUDA_VISIBLE_DEVICES=0
environment variable)::
Lines 251 to 256 in fa3f529
Got it! Thanks
Should I close this issue given the actual issue of multi-GPU validation computation is still not solved?
Let's leave it open until it's supported. Thanks for report, I hope this issue will be useful in the meantime.