Unhandled exception of type 'St13runtime_error' at a late phase of training
yonatanbitton opened this issue · 2 comments
yonatanbitton commented
Hello. I'm running the code on a VM, all works successfully till the training part.
After few models training, I'm receiving the following error:
[2019-08-17 08:16:54] Starting epoch 23
[2019-08-17 08:16:54] [sqlite] Selecting shuffled data
[2019-08-17 08:17:10] Ep. 23 : Up. 500 : Sen. 1,949 : Cost 1.56588244 : Time 99.42s : 1223.69 words/s : L.r. 1.0000e-04
[2019-08-17 08:17:12] Saving model weights and runtime parameters to ./models/EnVi.r2l.2/model.npz.best-ce-mean-words.npz
[2019-08-17 08:17:30] [valid] Ep. 23 : Up. 500 : ce-mean-words : 0.563603 : new best
[2019-08-17 08:17:55] Saving model weights and runtime parameters to ./models/EnVi.r2l.2/model.npz.best-translation.npz
[2019-08-17 08:18:04] Error: Unhandled exception of type 'St13runtime_error': npz_save: error saving to file: ./models/EnVi.r2l.2/model.npz.best-translation.npz
[2019-08-17 08:18:04] Error: Aborted from void unhandledException() in /home/jon/news-translit-nmt/tools/marian-dev/src/common/logging.cpp:107
[CALL STACK]
[0x6900c6]
[0x7f2f88ea86b6] + 0x8d6b6
[0x7f2f88ea8701] + 0x8d701
[0x7f2f88ea8919] + 0x8d919
[0x700fa3] marian::io:: saveItemsNpz (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&, std::vector<marian::io::Item,std::allocator<marian::io::Item>> const&) + 0x2113
[0x704c33] marian::io:: saveItems (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&, std::vector<marian::io::Item,std::allocator<marian::io::Item>> const&) + 0x303
[0x8f1fe0] marian::EncoderDecoder:: save (std::shared_ptr<marian::ExpressionGraph>, std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&, bool) + 0x140
[0x842244] marian::models::Stepwise:: save (std::shared_ptr<marian::ExpressionGraph>, std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&, bool) + 0x44
[0x988df8] marian::Validator<marian::data::Corpus>:: keepBest (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&) + 0x108
[0x989231] marian::Validator<marian::data::Corpus>:: updateStalled (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, float) + 0x81
[0x9b3da6] marian::TranslationValidator:: validate (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&) + 0x49e6
[0x92db3f] marian::Scheduler:: validate (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, bool) + 0x19f
[0x9529d2] marian::SyncGraphGroup:: update (std::vector<std::shared_ptr<marian::data::Batch>,std::allocator<std::shared_ptr<marian::data::Batch>>>, unsigned long) + 0xcb2
[0x953f96] marian::SyncGraphGroup:: update (std::shared_ptr<marian::data::Batch>) + 0x12f6
[0x66f46a] marian::Train<marian::SyncGraphGroup>:: run () + 0xbea
[0x59f86a] mainTrainer (int, char**) + 0x2ca
[0x57d38a] main + 0x8a
[0x7f2f88552830] __libc_start_main + 0xf0
[0x59d089] _start + 0x29
train-model.sh: line 49: 4015 Aborted (core dumped) $MARIAN/marian --devices $GPUS $OPTIONS --model $MODEL/model.npz --type s2s --train-sets $DATA/$LANGS.train.{src,trg} --vocabs $MODEL/vocab.yml $MODEL/vocab.yml --sqlite $MODEL/corpus.sqlite3 --max-length 80 --mini-batch-fit -w 3000 --mini-batch 100 --maxi-batch 1000 --best-deep --dropout-rnn 0.2 --dropout-src 0.2 --dropout-trg 0.1 --tied-embeddings-all --layer-normalization --exponential-smoothing --learn-rate 0.0001 --lr-decay 0.8 --lr-decay-strategy stalled --lr-decay-start 1 --lr-report --valid-freq 500 --save-freq 2000 --disp-freq 100 --valid-metrics ce-mean-words translation --valid-translation-output $MODEL/dev.out --quiet-translation --valid-sets $DATA/$LANGS.valid.{src,trg} --valid-script-path $MODEL/validate.sh --valid-mini-batch 64 --beam-size 10 --normalize 1.0 --early-stopping 10 --cost-type ce-mean-words --overwrite --keep-best --log $MODEL/train.log --valid-log $MODEL/valid.log
I do have some results when performing:
jon@jon:~/news-translit-nmt/experiments$ ./show-results.sh
ACC Fscore MRR MAPref
models/EnVi.1 0.4600 0.8726 0.5497 0.4600
models/EnVi.2 0.4800 0.8774 0.5701 0.4800
models/EnVi.3 0.4720 0.8801 0.5665 0.4720
models/EnVi.4 0.4600 0.8770 0.5562 0.4600
models/EnVi.r2l.1 0.4700 0.8804 0.5714 0.4700
But I can't predict (It seems that the reason is that I don't have the ensemble file)
jon@jon:~/news-translit-nmt/experiments$ head data/EnVi.dev.src | ./translate.sh EnVi file.tmp 0 > file.out
[2019-08-17 08:28:21] Error: File './models/EnVi.ens/ensemble.yml' does not exist
[2019-08-17 08:28:21] Error: Aborted from marian::io::InputFileStream::InputFileStream(const string&) in /home/jon/news-translit-nmt/tools/marian-dev/src/common/file_stream.h:139
[CALL STACK]
[0x59a434] marian::io::InputFileStream:: InputFileStream (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&) + 0x254
[0x609071] marian::ConfigParser:: loadConfigFiles (std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0xb1
[0x60bc23] marian::ConfigParser:: parseOptions (int, char**, bool) + 0x413
[0x5e2bd6] marian::Config:: initialize (int, char**, marian::cli::mode, bool) + 0x96
[0x5e52fd] marian::Config:: Config (int, char**, marian::cli::mode, bool) + 0x2d
[0x5e536d] marian:: parseOptions (int, char**, marian::cli::mode, bool) + 0x3d
[0x52badb] main + 0x3b
[0x7fe50ae06830] __libc_start_main + 0xf0
[0x542c49] _start + 0x29
./translate.sh: line 15: 5303 Done tee $PREFIX.in
5304 Aborted (core dumped) | $MARIAN/marian-decoder -c $MODEL.ens/ensemble.yml -d $GPUS --n-best --mini-batch 64 --maxi-batch 1000 --maxi-batch-sort src --quiet -w 4000 > $PREFIX.nbest.0
Those are the models I have:
jon@jon:~/news-translit-nmt/experiments/models$ ls
EnVi.1 EnVi.2 EnVi.3 EnVi.4 EnVi.r2l.1 EnVi.r2l.2
How can I solve it? Thanks
snukky commented
ensemble.yml
is automatically created within ensemble.sh
: https://github.com/snukky/news-translit-nmt/blob/master/experiments/ensemble.sh#L21
With regard to the training error: is there enough disk space? It looks like the first model has been saved successfully.
yonatanbitton commented
That's correct. The OS disk size was the problem. Thanks.