Unable to reproduce result on ES-EN TED talk dataset

Question

Unable to reproduce result on ES-EN TED talk dataset

Opened this issue 2 years ago · 0 comments

Hello, after cloning your repository for research purpose, I faced some
difficulties when I tried to reproduce some results you mention in your
paper (Document-Level Neural Machine Translation
with Hierarchical Attention
Networks). Especially, when I
tried to reproduce The score you have obtained on Ted Talk ES-EN dataset
from IWSLT 2014 evaluation campaign which I downloaded
here, I was unable to reproduce any of
your results. So I would like to make sure I run your pipeline
correctly.

Firstly, I followed these steps:

I modified your files (get_text.py and prepare.sh) in
preprocess_TED_zh-en to preprocess es-n files coming from the same
source (IWSLT2014). In get_text.py, I just replaced all occurrences
of "zh" by "es". In prepare.sh, I remove the 'zh' preprocessing and
I use the same preprocessing and parameters for 'es ' and 'en'
(punctuation normalization, tokenization en truecaser) . I then got
the dataset.pt files (dataset.train.pt, dataset.valid.pt...)
Then, I run sentence level training by running the script as is on
the repository (changing only the model and dataset name of course)
I finally run the second stage training (HAN encoder) by running the
second script as is on repository.

After training, I tried to evaluate the model by translating on
test2010-test2012 (concatenation of the three files followed by the same
preprocessing as training files) as you mentioned on your paper and I
got the following results when I used multi-bleu on the tokenized source
and target sentence:

Model BLEU (multi-bleu)

Base model 34.65 (20 epochs)
Base model + HAN encodeur 28.98 (1 epochs)

I also tried to increase the number of epochs during fine tune but the
results got worse.

I had this figure when I finetuned with HAN encodeur:

I also tried to modify the optimization parameters (warmup_step:8000 ->
900, lr:2 -> 0.1) when finetuned.

The new results was:

Model BLEU (multi-bleu)

Base model + HAN_enc 32.68 (1 epochs)

Whereas in the paper the results are:

I also tried to visualize the attention weights when translating the
following paragraph:

When I tried to visualize manually the attention weights in encoder
HAN when translating the word "su" in the last sentence, I have these
results:

Word attention:

Sentence attention:

Theses weights do not seem to correspond to those of the paper. They
seem quite uniform.

So I would like to know if I do something wrong in my pipeline, maybe
in preprocessing or in the parameters?

Am I doing somethings wrong when testing, maybe on BLEU ?

Do you use specific configuration for es-en ?

Is it necessary to have specific python environment?