wenwei202/caffe

One trivial modification, save my training

hiyijian opened this issue · 3 comments

Dear @wenwei202 ,
I found it is so hard to converage with group lasso term using scnn branch. The task is face classification with my own data and loss is simply SoftmaxWithLoss. At the very begining of training, the loss drops smoothly. When the process just reaches to about 2K iterations, the loss suddenly becomes to be 87.33, which is caused by some nan weights accroding to my log. Oh, one more thing, I train from scratch.
I think it is really ridiculous. After some hack into code, I found this line maybe problematic. I modify this to if(res > 0). Actually, It is inspired by the cpu version of the same math function. Then, every thing goes normal.

Although I think they are almost equal, the result are so diffrent. It only can be explained by numerical stability, But dont know why and how. Would you please to shed a light to this? Thank you.

This is weird since y is always positive. The numerical stability may be the issue. In GPU, the floating computation is not deterministic. My observation is that, even with the same seed, gpu cannot get the same accuracy/loss, which may be related.

yes, Cudnn it is not deterministic. Specifically, Backward API of Cudnn gives different gradients even if all inputs are exactly the same. However, this trivial modification make things consistently normal across many experiments in my case.
Thank you

Thanks, I have committed the modification and mentioned this issue thread. You may review e0a58fd.