jpuigcerver/Laia

IAM database example fails

BeatWolf opened this issue · 2 comments

When trying to run the IAM database example i get the following error:

laia.ImageDistorter {
dilate_rrate = 1
translate_stdv = 0.02
shear_prec = 4
rotate_prob = 0.5
erode_prob = 0.5
translate_prob = 0.5
erode_srate = 0.8
scale_prob = 0.5
erode_rrate = 1.2
dilate_srate = 0.4
dilate_prob = 0.5
rotate_prec = 100
scale_stdv = 0.12
shear_prob = 0.5
}
[2019-02-27 17:06:43 INFO] /opt/torch/share/lua/5.1/laia/CTCTrainer.lua:98: CTCTrainer uses the weight regularizer:
laia.WeightDecayRegularizer {
weight_l2_decay = 0
weight_l1_decay = 0
}
[2019-02-27 17:06:43 INFO] /opt/torch/share/lua/5.1/laia/CTCTrainer.lua:88: CTCTrainer uses the adversarial regularizer:
laia.AdversarialRegularizer {
adversarial_weight = 0
adversarial_epsilon = 0.0019607843137255
}
/opt/torch/bin/luajit: /opt/torch/share/lua/5.1/nn/Container.lua:67:
In 7 module of nn.Sequential:
/opt/torch/share/lua/5.1/cudnn/init.lua:166: Error in CuDNN: CUDNN_STATUS_EXECUTION_FAILED (cudnnSetDropoutDescriptor)
stack traceback:
[C]: in function 'error'
/opt/torch/share/lua/5.1/cudnn/init.lua:166: in function 'errcheck'
/opt/torch/share/lua/5.1/cudnn/RNN.lua:130: in function 'resetDropoutDescriptor'
/opt/torch/share/lua/5.1/cudnn/RNN.lua:526: in function </opt/torch/share/lua/5.1/cudnn/RNN.lua:449>
[C]: in function 'xpcall'
/opt/torch/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/opt/torch/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:453: in function '_fbPass'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:371: in function '_trainBatch'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:307: in function 'opfunc'
/opt/torch/share/lua/5.1/optim/rmsprop.lua:35: in function '_optimizer'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:305: in function 'trainEpoch'
/opt/torch/lib/luarocks/rocks/laia/scm-1/bin/laia-train-ctc:303: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/opt/torch/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/opt/torch/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:453: in function '_fbPass'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:371: in function '_trainBatch'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:307: in function 'opfunc'
/opt/torch/share/lua/5.1/optim/rmsprop.lua:35: in function '_optimizer'
/opt/torch/share/lua/5.1/laia/CTCTrainer.lua:305: in function 'trainEpoch'
/opt/torch/lib/luarocks/rocks/laia/scm-1/bin/laia-train-ctc:303: in main chunk
[C]: at 0x00405d50

Sadly i could not find any information anywhere what could be causing this error.
I was alble to run the spanish numbers example, but also encountered this same error when i tried to run the example with my own dataset.

Edit: As a clarification, i'm using the latest docker image of laia. I modified the scripts in the IAM steps folder to use the laia-docker commands instead of the normal ones.

How much memory does your GPU have? It might be a memory issue. If you do not have enough memory you would have to reduce the batch size.

Hi, it is a Tesla V100 with 16GB Ram. So i hope memory should not be an issue (unless the nividia-docker somehow limits the memory usage).
That said, i found out about PyLaia, which seems to be much easier to get to run, so i hope i'll have more luck there.
Thank you for the fast answer