Unable to reproduce result on ES-EN TED talk dataset
Opened this issue · 0 comments
Hello, after cloning your repository for research purpose, I faced some
difficulties when I tried to reproduce some results you mention in your
paper (Document-Level Neural Machine Translation
with Hierarchical Attention
Networks). Especially, when I
tried to reproduce The score you have obtained on Ted Talk ES-EN dataset
from IWSLT 2014 evaluation campaign which I downloaded
here, I was unable to reproduce any of
your results. So I would like to make sure I run your pipeline
correctly.
Firstly, I followed these steps:
-
I modified your files (get_text.py and prepare.sh) in
preprocess_TED_zh-en to preprocess es-n files coming from the same
source (IWSLT2014). In get_text.py, I just replaced all occurrences
of "zh" by "es". In prepare.sh, I remove the 'zh' preprocessing and
I use the same preprocessing and parameters for 'es ' and 'en'
(punctuation normalization, tokenization en truecaser) . I then got
the dataset.pt files (dataset.train.pt, dataset.valid.pt...) -
Then, I run sentence level training by running the script as is on
the repository (changing only the model and dataset name of course) -
I finally run the second stage training (HAN encoder) by running the
second script as is on repository.
After training, I tried to evaluate the model by translating on
test2010-test2012 (concatenation of the three files followed by the same
preprocessing as training files) as you mentioned on your paper and I
got the following results when I used multi-bleu on the tokenized source
and target sentence:
Model BLEU (multi-bleu)
Base model 34.65 (20 epochs)
Base model + HAN encodeur 28.98 (1 epochs)
I also tried to increase the number of epochs during fine tune but the
results got worse.
I had this figure when I finetuned with HAN encodeur:
I also tried to modify the optimization parameters (warmup_step:8000 ->
900, lr:2 -> 0.1) when finetuned.
The new results was:
Model BLEU (multi-bleu)
Base model + HAN_enc 32.68 (1 epochs)
Whereas in the paper the results are:
I also tried to visualize the attention weights when translating the
following paragraph:
When I tried to visualize manually the attention weights in encoder
HAN when translating the word "su" in the last sentence, I have these
results:
- Word attention:
- Sentence attention:
Theses weights do not seem to correspond to those of the paper. They
seem quite uniform.
So I would like to know if I do something wrong in my pipeline, maybe
in preprocessing or in the parameters?
Am I doing somethings wrong when testing, maybe on BLEU ?
Do you use specific configuration for es-en ?
Is it necessary to have specific python environment?