facebookresearch/multipathnet

FATAL THREAD PANIC - while training coco

estaudt opened this issue · 8 comments

I'm trying to train with the coco dataset and I run into the following errors. When attempting to train with train_multipathnet_coco.sh, I see this.

train_nGPU=2 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh
...
model_opt
{
model_conv345_norm : true
model_foveal_exclude : -1
model_het : true
}
/home/elliot/torch/install/bin/luajit: /home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range
stack traceback:
[C]: in function 'error'
/home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove'
/home/elliot/Devel/multipathnet/models/multipathnet.lua:32: in main chunk
[C]: in function 'dofile'
train.lua:104: in main chunk
[C]: in function 'dofile'
...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

When I attempt to train with train_coco.sh, I see this.

train_nGPU=1 test_nGPU=1 ./scripts/train_coco.sh
...
Loading proposals at {
1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7"
2 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/val.t7"
}
Done loading proposals

proposal images 123287

dataset images 118287

images 123287

nImages 118287
PANIC: unprotected error in call to Lua API (not enough memory)

Changing train_nGPU=1 to train_nGPU=2 yields the same output but with a different error.
FATAL THREAD PANIC: (pcall) not enough memory
FATAL THREAD PANIC: (write) not enough memory

I'm running on Ubuntu 14.04 LTS with two Titan X GPUs and 64GB of RAM.
Any ideas?

change nDonkey smaller, may be help

can reproduce, will try to fix. you can train on train instead of trainval meanwhile, I think it should not have this problem.

Update: Changing trainval to train and nDonkeys from 6 to 4 worked.

I changed trainval to train in train_coco.sh and ran into the following error.

Loading proposals at {
1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7"
}
Done loading proposals

proposal images 82783

dataset images 82783

images 82783

\nImages 82783
/home/elliot/torch/install/bin/luajit: ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 6 callback] not enough memory
stack traceback:
[C]: in function 'error'
...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific'
...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads'
...are/lua/5.1/torchnet/dataset/paralleldatasetiterator.lua:85: in function '__init'
/home/elliot/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/elliot/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'getIterator'
train.lua:122: in main chunk
[C]: in function 'dofile'
...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

However, when I then changed nDonkeys from 6 to 4, training commenced. I'm not actually sure what nDonkeys stands for. Regardless, thanks for the tips @szagoruyko and @northeastsquare.

@estaudt reducing nDonkeys turns off integral loss and increases data loading time

Getting the same error when executing-

train_nGPU=1 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh
...
model_opt
{
model_conv345_norm : true
model_foveal_exclude : -1
model_het : true
}
/home/vijay/torch/install/bin/luajit: /home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range
stack traceback:
[C]: in function 'error'
/home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove'
...

Any fix?

I comment out line in models/multipathnet.lua
--for i,v in ipairs{9,8,1} do classifier:remove(v) end

doing that results in the following :(

{
  1 : CudaTensor - size: 4x3x224x224
  2 : CudaTensor - empty
}

...

home/demo/torch/install/bin/luajit: ./modules/ModelParallelTable.lua:357: ModelParallelTable only supports CudaTensor, not torch.FloatTensor
stack traceback:
[C]: in function 'error'
./modules/ModelParallelTable.lua:357: in function 'type'

As another update, when I reduced nDonkeys, training seemed to run, but spit out NaNs for loss and 0 for everything else.