DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable

Question

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable

simonefrancia opened this issue 6 years ago · 6 comments

Hi,
I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ ) and translate english sentence with this script:

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=en

# suffix of target language
TRG=de

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../../mosesdecoder

# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../../subword-nmt

# path to nematus ( https://www.github.com/rsennrich/nematus )
nematus=../../nematus

# theano device
device=cpu

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \
$subword_nmt/apply_bpe.py -c $SRC$TRG.bpe | \
# translate
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \
     -m model.npz \
     -k 12 -n 
#-n -p 1 --suppress-unk | \
# postprocess
sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG

When I execute ./translate.sh < en_text.txt > output.txt, I got this error:

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
	 [[{{node model0/save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model0/save/Const_0_0, model0/save/RestoreV2/tensor_names, model0/save/RestoreV2/shape_and_slices)]]

ERROR: Translate worker process 600 crashed with exitcode 1
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de

Could you give me any suggest?
Thanks

Answer 1 · 2018-10-23T10:02:30.000Z

Hi, earlier this year, Nematus switched from using the Theano toolkit to using TensorFlow. It looks like you're trying to use the TensorFlow version of Nematus (i.e. the current master) with a Theano model. You have a couple of options: one is to convert the model from Theano format to TensorFlow format using a conversion script that comes with Nematus. The command should look something like this: CUDA_VISIBLE_DEVICES= python $nematus_home/nematus/theano_tf_convert.py \ --from_theano \ --in model.l2r.ens1.npz \ --out tf-model.l2r.ens1 The other option is to use the Theano version of Nematus, which is on the 'theano' branch of the repository. Note that this code is no longer actively maintained. Best wishes, Phil

…

On 23 Oct 2018, at 10:45, simonefrancia ***@***.***> wrote: Hi, I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ <http://data.statmt.org/rsennrich/wmt16_systems/> ) and translate english sentence with this script: # this sample script translates a test set, including # preprocessing (tokenization, truecasing, and subword segmentation), # and postprocessing (merging subword units, detruecasing, detokenization). # instructions: set paths to mosesdecoder, subword_nmt, and nematus, # then run "./translate.sh < input_file > output_file" # suffix of source language SRC=en # suffix of target language TRG=de # path to moses decoder: https://github.com/moses-smt/mosesdecoder mosesdecoder=../../mosesdecoder # path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt subword_nmt=../../subword-nmt # path to nematus ( https://www.github.com/rsennrich/nematus ) nematus=../../nematus # theano device device=cpu # preprocess $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \ $mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \ $mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \ $subword_nmt/apply_bpe.py -c $SRC$TRG.bpe | \ # translate THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \ -m model.npz \ -k 12 -n #-n -p 1 --suppress-unk | \ # postprocess sed 's/\@\@ //g' | \ $mosesdecoder/scripts/recaser/detruecase.perl | \ $mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG When I execute ./translate.sh < en_text.txt > output.txt, I got this error: [[{{node model0/save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model0/save/Const_0_0, model0/save/RestoreV2/tensor_names, model0/save/RestoreV2/shape_and_slices)]] ERROR: Translate worker process 600 crashed with exitcode 1 Warning: No built-in rules for language de. Detokenizer Version $Revision: 4134 $ Language: de Could you give me any suggest? Thanks — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#88>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDaY8mYObRWZRxNNyBWIDYhdSccNkXPks5unuVVgaJpZM4X1J5s>.

Answer 2 · 2018-11-09T10:01:51.000Z

Thanks for response.
In reference to this issue (marian-nmt/marian#219), I am using Nematus models inside Marian-NMT with good results as you can see:

Command:

./marian-decoder \
> --type nematus \
> --models /wmt17_systems/de-en/model.l2r.ens1.npz \
> --vocabs /wmt17_systems/de-en/vocab.de.json /wmt17_systems/de-en/vocab.en.json  \
> --dim-vocabs 74383 51100 \
> --enc-depth 1     \
> --enc-cell-depth 4     \
> --enc-type bidirectional      \
> --dec-depth 1   \
> --dec-cell-base-depth 8  \
> --dec-cell-high-depth 1   \
> --dec-cell gru-nematus \
> --enc-cell gru-nematus   \
> --tied-embeddings true \
> --layer-normalization true

INPUT:

Verbrachte 24 Stunden
Ich brauche mehr Stunden mit dir
Du hast das Wochenende verbracht
Gleich werden, ooh ooh
Wir haben die späten Nächte verbracht
Dinge richtig machen, zwischen uns
Aber jetzt ist alles gut, Baby
Rollen Sie das Backwood-Baby
Und spiel mich in der Nähe

OUTPUT:

UK@@ 24 hours
I need more hours with you
you 've spent the weekend
Vilnius
we 've spent the late nights
things Right unify us
but now everything IS Baby
roll the Atlantic
and play me in the video@@ game me nearby

I have two question:

How many Nematus pretrained models (different languages) can I apply like in this example?
Could I apply the same command with same parameters also, for example, to model en->ru
(http://data.statmt.org/wmt17_systems/en-ru/) ?
is there also a post-processing task that treats special parts of output like "UK@@" "video@@ "?

Thanks

Answer 3 · 2018-11-12T17:17:26.000Z

This should work with all 11 language pairs on http://data.statmt.org/wmt17_systems/
each directory has the script postprocess.sh, which performs the necessary post-processing. For example, check http://data.statmt.org/wmt17_systems/de-en/postprocess.sh

Answer 4 · 2018-11-13T12:00:35.000Z

Thanks for clear response.
Another question: for some languages pairs for which there is no pretrained, I would like to make a train.
For this purpose, is it sufficient to follow only these instructions (http://data.statmt.org/wmt17_systems/training/)?
In these days I will try to do my train.

Thanks in advance

Answer 5 · 2018-11-13T12:06:49.000Z

Yes, these instructions should help you train your own model. You may want to change some things, e.g. the preprocessing, depending on the language pair.

Answer 6 · 2018-11-14T15:16:11.000Z

Thanks a lot!