EdinburghNLP/nematus

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable

simonefrancia opened this issue · 6 comments

Hi,
I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ ) and translate english sentence with this script:

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=en

# suffix of target language
TRG=de

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../../mosesdecoder

# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../../subword-nmt

# path to nematus ( https://www.github.com/rsennrich/nematus )
nematus=../../nematus

# theano device
device=cpu

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \
$subword_nmt/apply_bpe.py -c $SRC$TRG.bpe | \
# translate
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \
     -m model.npz \
     -k 12 -n 
#-n -p 1 --suppress-unk | \
# postprocess
sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG

When I execute ./translate.sh < en_text.txt > output.txt, I got this error:

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
	 [[{{node model0/save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model0/save/Const_0_0, model0/save/RestoreV2/tensor_names, model0/save/RestoreV2/shape_and_slices)]]

ERROR: Translate worker process 600 crashed with exitcode 1
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de

Could you give me any suggest?
Thanks

Thanks for response.
In reference to this issue (marian-nmt/marian#219), I am using Nematus models inside Marian-NMT with good results as you can see:

Command:

./marian-decoder \
> --type nematus \
> --models /wmt17_systems/de-en/model.l2r.ens1.npz \
> --vocabs /wmt17_systems/de-en/vocab.de.json /wmt17_systems/de-en/vocab.en.json  \
> --dim-vocabs 74383 51100 \
> --enc-depth 1     \
> --enc-cell-depth 4     \
> --enc-type bidirectional      \
> --dec-depth 1   \
> --dec-cell-base-depth 8  \
> --dec-cell-high-depth 1   \
> --dec-cell gru-nematus \
> --enc-cell gru-nematus   \
> --tied-embeddings true \
> --layer-normalization true

INPUT:

Verbrachte 24 Stunden
Ich brauche mehr Stunden mit dir
Du hast das Wochenende verbracht
Gleich werden, ooh ooh
Wir haben die späten Nächte verbracht
Dinge richtig machen, zwischen uns
Aber jetzt ist alles gut, Baby
Rollen Sie das Backwood-Baby
Und spiel mich in der Nähe

OUTPUT:

UK@@ 24 hours
I need more hours with you
you 've spent the weekend
Vilnius
we 've spent the late nights
things Right unify us
but now everything IS Baby
roll the Atlantic
and play me in the video@@ game me nearby

I have two question:

  1. How many Nematus pretrained models (different languages) can I apply like in this example?
    Could I apply the same command with same parameters also, for example, to model en->ru
    (http://data.statmt.org/wmt17_systems/en-ru/) ?

  2. is there also a post-processing task that treats special parts of output like "UK@@" "video@@ "?

Thanks

  1. This should work with all 11 language pairs on http://data.statmt.org/wmt17_systems/

  2. each directory has the script postprocess.sh, which performs the necessary post-processing. For example, check http://data.statmt.org/wmt17_systems/de-en/postprocess.sh

Thanks for clear response.
Another question: for some languages pairs for which there is no pretrained, I would like to make a train.
For this purpose, is it sufficient to follow only these instructions (http://data.statmt.org/wmt17_systems/training/)?
In these days I will try to do my train.

Thanks in advance

Yes, these instructions should help you train your own model. You may want to change some things, e.g. the preprocessing, depending on the language pair.

Thanks a lot!