Using this code (transformer) on Multi30k English French monolingual
ahmadrash opened this issue · 5 comments
Any experience with running this code on smaller dataset such as Multi30k. The Bleu in the first paper https://arxiv.org/pdf/1711.00043.pdf was around 27.48/28.07. I am trying to get something close to that with transformer based encoder-decoder. Any suggestions. @glample Thanks!
On French-English direction, I am able to get results with few BLEU points behind using Transformer with this code. At 25 iteration, I got 25.37 (En->Fr), 26.84 (Fr->En).
I use newstest as monolingual corpus to train joint embedding. I take English and French corpus from newstest2007 to newstest2017, random sample 10M for each language, concatenate two corpus together, apply BPE to it with 60k codes, and train joint word embedding with fastText default parameter.
For UnsupervisedMT parameters, here are the values I use:
attention: True
attention_dropout: 0.0
batch_size: 32
beam_size: 0
clip_grad_norm: 5.0
dec_optimizer: enc_optimizer
decoder_attention_heads: 8
decoder_normalize_before: False
dropout: 0.2
emb_dim: 512
enc_optimizer: adam,lr=0.0001
encoder_attention_heads: 8
encoder_normalize_before: False
epoch_size: 58000
freeze_dec_emb: False
freeze_enc_emb: False
group_by_size: True
hidden_dim: 512
id2lang: {0: 'en', 1: 'fr'}
label_smoothing: 0.1
lambda_dis: 0
lambda_lm: 0
lambda_xe_back: 0
lambda_xe_mono: 1
lambda_xe_otfa: 0
lambda_xe_otfd: 1
lambda_xe_para: 0
lang2id: {'en': 0, 'fr': 1}
langs: ['en', 'fr']
length_penalty: 1.0
lstm_proj: False
max_epoch: 200
max_len: 175
max_vocab: -1
mono_directions: ['en', 'fr']
n_back: 0
n_dec_layers: 4
n_dis: 0
n_enc_layers: 4
n_langs: 2
n_mono: -1
n_para: 0
otf_backprop_temperature: -1.0
otf_num_processes: 30
otf_sample: -1.0
otf_sync_params_every: 1000
otf_update_dec: True
otf_update_enc: True
para_directions: []
pivo_directions: [('en', 'fr', 'en'), ('fr', 'en', 'fr')]
pretrained_out: True
reload_dec: False
reload_dis: False
reload_enc: False
reload_model:
relu_dropout: 0.0
seed: -1
share_dec: 4
share_decpro_emb: True
share_enc: 4
share_encdec_emb: True
share_lang_emb: True
share_lstm_proj: False
share_output_emb: True
stopping_criterion: bleu_en_fr_valid,10
transformer: True
transformer_ffn_emb_dim: 512
vocab: {}
vocab_min_count: 0
word_blank: 0.2
word_dropout: 0.1
word_shuffle: 3.0
Thanks a lot @pipibjc I was just training the word embedding on the 10 million. I was using the tokenization from Multi30k but my result was really low. I will try with your setup.
@pipibjc Did you use the entire Multi30k dataset or split it into half English and half French as the setup of the original paper.
@ahmadrash I split it into half English and half French as the setup of the original paper. There are 14500 sentences for each monolingual corpus.
Thanks!