idiap/HAN_NMT

Unable to reproduce result on ES-EN TED talk dataset

Opened this issue · 0 comments

Hello, after cloning your repository for research purpose, I faced some
difficulties when I tried to reproduce some results you mention in your
paper (Document-Level Neural Machine Translation
with Hierarchical Attention
Networks
). Especially, when I
tried to reproduce The score you have obtained on Ted Talk ES-EN dataset
from IWSLT 2014 evaluation campaign which I downloaded
here, I was unable to reproduce any of
your results. So I would like to make sure I run your pipeline
correctly.

Firstly, I followed these steps:

  • I modified your files (get_text.py and prepare.sh) in
    preprocess_TED_zh-en to preprocess es-n files coming from the same
    source (IWSLT2014). In get_text.py, I just replaced all occurrences
    of "zh" by "es". In prepare.sh, I remove the 'zh' preprocessing and
    I use the same preprocessing and parameters for 'es ' and 'en'
    (punctuation normalization, tokenization en truecaser) . I then got
    the dataset.pt files (dataset.train.pt, dataset.valid.pt...)

  • Then, I run sentence level training by running the script as is on
    the repository (changing only the model and dataset name of course)

  • I finally run the second stage training (HAN encoder) by running the
    second script as is on repository.

After training, I tried to evaluate the model by translating on
test2010-test2012 (concatenation of the three files followed by the same
preprocessing as training files) as you mentioned on your paper and I
got the following results when I used multi-bleu on the tokenized source
and target sentence:

Model BLEU (multi-bleu)


Base model 34.65 (20 epochs)
Base model + HAN encodeur 28.98 (1 epochs)

I also tried to increase the number of epochs during fine tune but the
results got worse.

I had this figure when I finetuned with HAN encodeur:
image1

I also tried to modify the optimization parameters (warmup_step:8000 ->
900, lr:2 -> 0.1) when finetuned.

The new results was:

Model BLEU (multi-bleu)


Base model + HAN_enc 32.68 (1 epochs)

Whereas in the paper the results are:

image2

I also tried to visualize the attention weights when translating the
following paragraph:

image3

When I tried to visualize manually the attention weights in encoder
HAN when translating the word "su" in the last sentence, I have these
results:

  • Word attention:

image4

  • Sentence attention:

image5

Theses weights do not seem to correspond to those of the paper. They
seem quite uniform.

So I would like to know if I do something wrong in my pipeline, maybe
in preprocessing or in the parameters?

Am I doing somethings wrong when testing, maybe on BLEU ?

Do you use specific configuration for es-en ?

Is it necessary to have specific python environment?