error -9 when training caffe-alexnet model

Question

error -9 when training caffe-alexnet model

srinivasmangipudi opened this issue 7 years ago · 3 comments

The job run for about 2 mins, but when on process#60 its crashing with the following error.see image attached.

Answer 1 · 2018-03-21T14:41:00.000Z

I'm not entirely sure, but my guess is that this is either your system running out of memory or a problem with it picking incorrectly between cpu vs. gpu.

Could be NVIDIA/DIGITS#1402 ?

Answer 2 · 2018-04-10T09:37:19.000Z

Reproduced the error on my docker box. By default docker is allocating 2G memory for the pod on my Macbook, which is insufficient in this case. Seen from DIGITS dashboard, the training is eating up ~3G memory.

For my case, increasing memory in docker preference panel works. Navigate through the docker whale icon -> preferences -> advanced -> memory, then increase accordingly.

Answer 3 · 2018-04-10T14:19:02.000Z

@ln3333 thanks for this, I've added a note and pushed it. Closing.

I'm in the process of rewriting this for TensorFlow and TensorFlow.js right now in #14, so I think further debugging of DIGITS issues isn't necessary.