facebookresearch/UnsupervisedMT

Using this code (transformer) on Multi30k English French monolingual

ahmadrash opened this issue · 5 comments

Any experience with running this code on smaller dataset such as Multi30k. The Bleu in the first paper https://arxiv.org/pdf/1711.00043.pdf was around 27.48/28.07. I am trying to get something close to that with transformer based encoder-decoder. Any suggestions. @glample Thanks!

On French-English direction, I am able to get results with few BLEU points behind using Transformer with this code. At 25 iteration, I got 25.37 (En->Fr), 26.84 (Fr->En).

I use newstest as monolingual corpus to train joint embedding. I take English and French corpus from newstest2007 to newstest2017, random sample 10M for each language, concatenate two corpus together, apply BPE to it with 60k codes, and train joint word embedding with fastText default parameter.

For UnsupervisedMT parameters, here are the values I use:

      attention: True
      attention_dropout: 0.0
      batch_size: 32
      beam_size: 0
      clip_grad_norm: 5.0
      dec_optimizer: enc_optimizer
      decoder_attention_heads: 8
      decoder_normalize_before: False
      dropout: 0.2
      emb_dim: 512
      enc_optimizer: adam,lr=0.0001
      encoder_attention_heads: 8
      encoder_normalize_before: False
      epoch_size: 58000
      freeze_dec_emb: False
      freeze_enc_emb: False
      group_by_size: True
      hidden_dim: 512
      id2lang: {0: 'en', 1: 'fr'}
      label_smoothing: 0.1
      lambda_dis: 0
      lambda_lm: 0
      lambda_xe_back: 0
      lambda_xe_mono: 1
      lambda_xe_otfa: 0
      lambda_xe_otfd: 1
      lambda_xe_para: 0
      lang2id: {'en': 0, 'fr': 1}
      langs: ['en', 'fr']
      length_penalty: 1.0
      lstm_proj: False
      max_epoch: 200
      max_len: 175
      max_vocab: -1
      mono_directions: ['en', 'fr']
      n_back: 0
      n_dec_layers: 4
      n_dis: 0
      n_enc_layers: 4
      n_langs: 2
      n_mono: -1
      n_para: 0
      otf_backprop_temperature: -1.0
      otf_num_processes: 30
      otf_sample: -1.0
      otf_sync_params_every: 1000
      otf_update_dec: True
      otf_update_enc: True
      para_directions: []
      pivo_directions: [('en', 'fr', 'en'), ('fr', 'en', 'fr')]
      pretrained_out: True
      reload_dec: False
      reload_dis: False
      reload_enc: False
      reload_model: 
      relu_dropout: 0.0
      seed: -1
      share_dec: 4
      share_decpro_emb: True
      share_enc: 4
      share_encdec_emb: True
      share_lang_emb: True
      share_lstm_proj: False
      share_output_emb: True
      stopping_criterion: bleu_en_fr_valid,10
      transformer: True
      transformer_ffn_emb_dim: 512
      vocab: {}
      vocab_min_count: 0
      word_blank: 0.2
      word_dropout: 0.1
      word_shuffle: 3.0

Thanks a lot @pipibjc I was just training the word embedding on the 10 million. I was using the tokenization from Multi30k but my result was really low. I will try with your setup.

@pipibjc Did you use the entire Multi30k dataset or split it into half English and half French as the setup of the original paper.

@ahmadrash I split it into half English and half French as the setup of the original paper. There are 14500 sentences for each monolingual corpus.

Thanks!