/RMLNMT

Improving Both Domain Robustness and Domain Adaptability in Machine Translation (COLING 2022)

Primary LanguagePython

RMLNMT

code implements for paper Improving Both Domain Robustness and Domain Adaptability in Machine Translation(COLING 2022), the code is based on public code: fairseq, we provide the implement of different classifier, and word-level domain mixing.


Requirements

  1. Fairseq (v0.6.0)
  2. Pytorch
  3. all requirements are shown in requirements.txt, you can install using pip install -r requirements.txt

Pipeline

To reproduce the results of our experiments, please clean your OPUS corpus first, especially de-duplicate the corpus (see more details in Appendix of the paper).

  1. Train a domain classifier based on BERT/ CNN etc in domain_classification/Bert_classfier.py or domain_classification/main.py

  2. Score the sentence to represent the domain similarity with general domains:

    python meta_score_prepare.py \
    --num_labels 11 \
    --device_id 7 \
    --model_name bert-base-uncased \
    --input_path $YOUR_INPUT_PATH \
    --cls_data $YOUR_CLASSIFICATION_PATH \
    --out_data $YOUR_OUTPUT_PATH \
    --script_path $SCRIPT_PATH
  3. Run baseline systems using fairseq, Meta-MT and Meta-curriculum.

  4. code related to the word-lvel domain mixing is in word_moudles, and please use the following command to reproduce the results in our paper:

    python -u $code_dir/meta_ws_adapt_training.py $DARA_DIR \
        --train-subset meta-train-spm $META_DEV \
        --damethod bayesian \
        --arch transformer_da_bayes_iwslt_de_en \
        --criterion $CRITERION $BASELINE \
        --domains $DOMAINS --max-tokens 1 \
        --user-dir $user_dir \
        --domain-nums 5 \
        --translation-task en2de \
        --source-lang en --target-lang de \
        --is-curriculum --split-by-cl --distributed-world-size $GPUS \
        --required-batch-size-multiple 1 \
        --tensorboard-logdir $TF_BOARD \
        --optimizer $OPTIMIZER --lr $META_LR $DO_SAVE \
        --save-dir $PT_OUTPUT_DIR --save-interval-updates $SAVEINTERVALUPDATES \
        --max-epoch 20 \
        --skip-invalid-size-inputs-valid-test \
        --flush-secs 1 --train-percentage 0.99 --restore-file $PRE_TRAIN --log-format json \
        --- --task word_adapt_new --is-curriculum \
        --train-subset support --test-subset query --valid-subset dev_sub \
        --max-tokens 2000 --skip-invalid-size-inputs-valid-test \
        --update-freq 10000 \
        --domain-nums 5 \
        --translation-task en2de \
        --distributed-world-size 1 --max-epoch 1 --optimizer adam \
        --damethod bayesian --criterion cross_entropy_da \
        --lr 5e-05 --lr-scheduler inverse_sqrt --no-save \
        --support-tokens 8000 --query-tokens 16000 \
        --source-lang en --label-smoothing 0.1 \
        --adam-betas '(0.9, 0.98)' --warmup-updates 4000 \
        --warmup-init-lr '1e-07' --weight-decay 0.0001 \
        --target-lang de \
        --user-dir $user_dir

If you find our paper useful, please kindly cite our paper. Thanks!

@inproceedings{lai-etal-2022-improving-domain,
    title = "Improving Both Domain Robustness and Domain Adaptability in Machine Translation",
    author = "Lai, Wen  and
      Libovick{\'y}, Jind{\v{r}}ich  and
      Fraser, Alexander",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.461",
    pages = "5191--5204",
}

Contact

If you have any questions about our paper, please feel convenient to let me know through email: lavine@cis.lmu.de