Network diverges after some times
melgor opened this issue · 4 comments
Sometimes when I use 'optnet' or 'shareGradInput' from fb.resnet, I have a problem of diverging the network.
I mean exactly that I am learning the network for some epochs, then suddenly I get gradient explosion (where there were no change in parameters of learning). In the other hand, without it and having 2x smaller batch size and smaller learning rate, everything works.
This problem is rare using 'optnet', but I have it sometimes. I have logs from 2 type of networks:
VGG (fine-tunning) and ResNet(only without batchnorm).
Example output:
https://gist.github.com/melgor/236668ba7ba27a8efd63152e8dfedd16
Did you have such problem anytime?
@melgor can you check that gradients are correct with inplace ops and optnet with this script? https://gist.github.com/szagoruyko/120b1c84ca3df532a597ca1f4db655dd
I run that script with my model definition and it seems that everything works fine, no error at gradients.
I forgot to mention: I use 2 GPU training, I don't not if this matter.
To make script works, I need to delete such line in my model definition (if that matters):
model:get(1).gradInput = nil
I have the same issue while training the net without optnet
, so it is not related to it. Maybe you can point what reason it can be for such case:
LR = 0.06
momentum = 0.9
WD = 0.0005
optnet = false
and is diverge at 2 epoch. If it will be at first, it is clear to high LR. But at 2 epoch my accuracy is > 20% and suddenly it blow up. Do you know where I should find the reason of such situation?
@melgor If you had BatchNorm, I'd say that your network had one layer for with the output was constantly 0, which makes the result diverge after some time. But that's not the case, so I can't say much more only with what you said. Learning rate is still probably too high.
Anyway, this seems not to be related to optnet
. I'm closing the issue.