Unhandled exception of type 'St13runtime_error' at a late phase of training

Question

Unhandled exception of type 'St13runtime_error' at a late phase of training

yonatanbitton opened this issue 5 years ago · 2 comments

Hello. I'm running the code on a VM, all works successfully till the training part.
After few models training, I'm receiving the following error:

[2019-08-17 08:16:54] Starting epoch 23
[2019-08-17 08:16:54] [sqlite] Selecting shuffled data
[2019-08-17 08:17:10] Ep. 23 : Up. 500 : Sen. 1,949 : Cost 1.56588244 : Time 99.42s : 1223.69 words/s : L.r. 1.0000e-04
[2019-08-17 08:17:12] Saving model weights and runtime parameters to ./models/EnVi.r2l.2/model.npz.best-ce-mean-words.npz
[2019-08-17 08:17:30] [valid] Ep. 23 : Up. 500 : ce-mean-words : 0.563603 : new best
[2019-08-17 08:17:55] Saving model weights and runtime parameters to ./models/EnVi.r2l.2/model.npz.best-translation.npz
[2019-08-17 08:18:04] Error: Unhandled exception of type 'St13runtime_error': npz_save: error saving to file: ./models/EnVi.r2l.2/model.npz.best-translation.npz
[2019-08-17 08:18:04] Error: Aborted from void unhandledException() in /home/jon/news-translit-nmt/tools/marian-dev/src/common/logging.cpp:107

[CALL STACK]
[0x6900c6]
[0x7f2f88ea86b6]                                                       + 0x8d6b6
[0x7f2f88ea8701]                                                       + 0x8d701
[0x7f2f88ea8919]                                                       + 0x8d919
[0x700fa3]          marian::io::  saveItemsNpz  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  std::vector<marian::io::Item,std::allocator<marian::io::Item>> const&) + 0x2113
[0x704c33]          marian::io::  saveItems  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  std::vector<marian::io::Item,std::allocator<marian::io::Item>> const&) + 0x303
[0x8f1fe0]          marian::EncoderDecoder::  save  (std::shared_ptr<marian::ExpressionGraph>,  std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  bool) + 0x140
[0x842244]          marian::models::Stepwise::  save  (std::shared_ptr<marian::ExpressionGraph>,  std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  bool) + 0x44
[0x988df8]          marian::Validator<marian::data::Corpus>::  keepBest  (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&) + 0x108
[0x989231]          marian::Validator<marian::data::Corpus>::  updateStalled  (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&,  float) + 0x81
[0x9b3da6]          marian::TranslationValidator::  validate  (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&) + 0x49e6
[0x92db3f]          marian::Scheduler::  validate  (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&,  bool) + 0x19f
[0x9529d2]          marian::SyncGraphGroup::  update  (std::vector<std::shared_ptr<marian::data::Batch>,std::allocator<std::shared_ptr<marian::data::Batch>>>,  unsigned long) + 0xcb2
[0x953f96]          marian::SyncGraphGroup::  update  (std::shared_ptr<marian::data::Batch>) + 0x12f6
[0x66f46a]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0xbea
[0x59f86a]          mainTrainer  (int,  char**)                        + 0x2ca
[0x57d38a]          main                                               + 0x8a
[0x7f2f88552830]    __libc_start_main                                  + 0xf0
[0x59d089]          _start                                             + 0x29

train-model.sh: line 49:  4015 Aborted                 (core dumped) $MARIAN/marian --devices $GPUS $OPTIONS --model $MODEL/model.npz --type s2s --train-sets $DATA/$LANGS.train.{src,trg} --vocabs $MODEL/vocab.yml $MODEL/vocab.yml --sqlite $MODEL/corpus.sqlite3 --max-length 80 --mini-batch-fit -w 3000 --mini-batch 100 --maxi-batch 1000 --best-deep --dropout-rnn 0.2 --dropout-src 0.2 --dropout-trg 0.1 --tied-embeddings-all --layer-normalization --exponential-smoothing --learn-rate 0.0001 --lr-decay 0.8 --lr-decay-strategy stalled --lr-decay-start 1 --lr-report --valid-freq 500 --save-freq 2000 --disp-freq 100 --valid-metrics ce-mean-words translation --valid-translation-output $MODEL/dev.out --quiet-translation --valid-sets $DATA/$LANGS.valid.{src,trg} --valid-script-path $MODEL/validate.sh --valid-mini-batch 64 --beam-size 10 --normalize 1.0 --early-stopping 10 --cost-type ce-mean-words --overwrite --keep-best --log $MODEL/train.log --valid-log $MODEL/valid.log

I do have some results when performing:

jon@jon:~/news-translit-nmt/experiments$  ./show-results.sh
                                                ACC     Fscore  MRR     MAPref
models/EnVi.1                                   0.4600  0.8726  0.5497  0.4600
models/EnVi.2                                   0.4800  0.8774  0.5701  0.4800
models/EnVi.3                                   0.4720  0.8801  0.5665  0.4720
models/EnVi.4                                   0.4600  0.8770  0.5562  0.4600
models/EnVi.r2l.1                               0.4700  0.8804  0.5714  0.4700

But I can't predict (It seems that the reason is that I don't have the ensemble file)

jon@jon:~/news-translit-nmt/experiments$  head data/EnVi.dev.src | ./translate.sh EnVi file.tmp 0 > file.out
[2019-08-17 08:28:21] Error: File './models/EnVi.ens/ensemble.yml' does not exist
[2019-08-17 08:28:21] Error: Aborted from marian::io::InputFileStream::InputFileStream(const string&) in /home/jon/news-translit-nmt/tools/marian-dev/src/common/file_stream.h:139

[CALL STACK]
[0x59a434]          marian::io::InputFileStream::  InputFileStream  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&) + 0x254
[0x609071]          marian::ConfigParser::  loadConfigFiles  (std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0xb1
[0x60bc23]          marian::ConfigParser::  parseOptions  (int,  char**,  bool) + 0x413
[0x5e2bd6]          marian::Config::  initialize  (int,  char**,  marian::cli::mode,  bool) + 0x96
[0x5e52fd]          marian::Config::  Config  (int,  char**,  marian::cli::mode,  bool) + 0x2d
[0x5e536d]          marian::  parseOptions  (int,  char**,  marian::cli::mode,  bool) + 0x3d
[0x52badb]          main                                               + 0x3b
[0x7fe50ae06830]    __libc_start_main                                  + 0xf0
[0x542c49]          _start                                             + 0x29

./translate.sh: line 15:  5303 Done                    tee $PREFIX.in
      5304 Aborted                 (core dumped) | $MARIAN/marian-decoder -c $MODEL.ens/ensemble.yml -d $GPUS --n-best --mini-batch 64 --maxi-batch 1000 --maxi-batch-sort src --quiet -w 4000 > $PREFIX.nbest.0

Those are the models I have:

jon@jon:~/news-translit-nmt/experiments/models$ ls
EnVi.1  EnVi.2  EnVi.3  EnVi.4  EnVi.r2l.1  EnVi.r2l.2

How can I solve it? Thanks

Answer 1 · 2019-08-17T08:36:46.000Z

ensemble.yml is automatically created within ensemble.sh: https://github.com/snukky/news-translit-nmt/blob/master/experiments/ensemble.sh#L21

With regard to the training error: is there enough disk space? It looks like the first model has been saved successfully.

Answer 2 · 2019-08-17T08:55:36.000Z

That's correct. The OS disk size was the problem. Thanks.