train the example model error: Segmentation fault
Lingogo opened this issue · 2 comments
Lingogo commented
Hi:
When I train the de-en model with the command in github README, I got following error info:
| [en] Dictionary: 24738 types
| [de] Dictionary: 35474 types
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 160215 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 7282 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 6750 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 7282 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 6750 examples
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/generic/THCTensorMath.cu line=26 error=77 : an illegal memory access was encountered
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/generic/THCStorage.cu line=66 error=77 : an illegal memory access was encountered
/home/yulinlin/torch/install/bin/luajit: ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 6 callback] /home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
/home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:26: Creating MTGP constants failed. at /tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/THCTensorRandom.cu:33
stack traceback:
[C]: in function 'bernoulli'
/home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:26: in function </home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:17>
[C]: in function 'xpcall'
/home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...e/yulinlin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:370: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:347>
[C]: in function 'xpcall'
...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:13: in main chunk
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...e/yulinlin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:370: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:347>
[C]: in function 'xpcall'
...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
[C]: in function 'error'
...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:385: in function 'doTrain'
...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:189: in function 'train'
...in/torch/install/share/lua/5.1/fairseq/scripts/train.lua:410: in main chunk
[C]: in function 'require'
...rch/install/lib/luarocks/rocks/fairseq/scm-1/bin/fairseq:17: in main chunk
[C]: at 0x00406670
Segmentation fault
Does someone know any causes of this?
jgehring commented
The backtrace points to an error in the nn.Dropout module. I can only guess, but are you maybe running out of GPU memory? Does your GPU work well for other use-cases?