stephenyan1231/caffe-public

loss is nan during training HD-CNN

Opened this issue · 4 comments

Hi,
I saw this page: https://sites.google.com/site/homepagezhichengyan/home/hdcnn/code
and try train using CIFAR-100, but in training time, displayed loss is nan. but accuracy seems to be improved little by little. Could you kindly explain this?

After the whole training phase, accuracy become < 0.01.
now I talk about this part:

Train a CNN using 'train_train' set as training data and 'train_val' set as testing data.
command: ./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh

Hi,

I observed a similar problem and I found that cuDNN was the one to blame. It introduced unknown bugs on CIFAR100 dataset. The way I fixed it is to disable cuDNN by setting 'USE_CUDNN := 0' in makefile.config. Can you try this?

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.

It might be a multi-gpu issue. Fortunately, on Cifar100 dataset, the training speed with a single GPU is fine.
Usually the multi-gpu training works when the two GPUs have peer-to-peer access, which can be verified by an example in CUDA samples installation folder.

Thanks
Zhicheng “Stephen"

On Oct 8, 2015, at 1:49 PM, Hokuto Kagaya notifications@github.com wrote:

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.


Reply to this email directly or view it on GitHub #2 (comment).