/PARASITE

🪱 PARASITE || A parallel sentence data preprocessing toolkit. Originally developed as a part of the `en-ru` winner submission of WMT20 Biomedical Translation Task.

Primary LanguagePythonMIT LicenseMIT

parasite

A parallel sentence preprocessing toolkit

Interface

The codebase uses python-fire to have a flexible, pipelined CLI interface.

The module parasite.pipeline implements CLI over all the basic concepts of the codebase.

We recommend using AlignedBiText from_files for working with a single bi-text document or AlignedBiText batch_from_files to work with multiple files.

Results

Here is effect of using different components as part of preprocessing, filtering and monotonic alignments pipeline.

All the numbers represent the BLEU score on the WMT20 MEDLINE (local test) set for different data preprocessing configurations (and the exact same architecture and learning parameters).

Model en → ru ru → en
baseline configuration 30.7 31.3
+ greedy alignments 30.1 31.8
+ detect subsection names 30.7 32.3
+ remove titles 31.3 32.5
+ optimize total similarity 30.4 32.2
+ normalize distance matrix 30.8 32.1
+ penalize source/target ratio 31.2 31.5
+ one-to-many (K=3) 32.2 32.3

Here are training graphs on Aim for en → ru, averaged across different runs. The best configuration shows significantly better results. Play with experiments here:

Example

To replicate our best submission (run 2) (WMT20 Biomedical Translation Task winner models for en-ru language pair) preprocessing, please run:

python -m parasite.pipeline \
    AlignedBiText batch_from_files /datasets/wmt20.biomed.ru-en.medline_train/raw_files/*_en.txt \
        --suffix=".txt" --src-lang="en" --tgt-lang="ru" \
    - apply segmenter reset \
    - apply segmenter scispacy --only-src \
    - apply segmenter razdel --only-tgt \
    - apply segmenter remove-title --only-tgt --blacklist='Резюме' \
    - apply segmenter keyword --only-src --path='examples/medline_keywords/eng_few.txt' \
    - apply segmenter keyword --only-tgt --path='examples/medline_keywords/rus_few.txt' \
    - apply segmenter remove-title --only-src \
    - apply encoder pretrained-transformer "xlm-roberta-large" \
        --normalize=2 --force-lowercase --normalize-length=avg --fp16 \
    - apply aligner greedy-one2one --distance=euclidean \
    --progress \
    - split --mapping-path="examples/wmt20.biomed.ru-en.medline_train.yerevann.splits.txt" \
    - to_files --output-dir="/datasets/wmt20.biomed.ru-en.medline_train/preprocessed_files"

In order to replicate our overall best preprocessing (not submitted, described in the paper), you can run:

python -m parasite.pipeline \
    AlignedBiText batch_from_files /datasets/wmt20.biomed.ru-en.medline_train/raw_files/*_en.txt \
        --suffix=".txt" --src-lang="en" --tgt-lang="ru" \
    - apply segmenter reset \
    - apply segmenter syntok \
    - apply segmenter remove-title --only-tgt --blacklist='Резюме' \
    - apply segmenter keyword --only-src --path='examples/medline_keywords/eng.txt' \
    - apply segmenter keyword --only-tgt --path='examples/medline_keywords/rus.txt' \
    - apply segmenter remove-title --only-src \
    - apply encoder pretrained-transformer "xlm-roberta-large" \
        --encode-windows=3  --normalize-length=avg --fp16 \
    - apply aligner dynamic \
        --max-k=3 --penalty-ratio=2 --distance=euclidean --normalize=True \
    --progress \
    - split --mapping-path="examples/wmt20.biomed.ru-en.medline_train.yerevann.splits.txt" \
    - to_files --output-dir="/datasets/wmt20.biomed.ru-en.medline_train/preprocessed_files"

Citation

In order to cite our work, please consider the following BibTeX:

@inproceedings{hambardzumyan-etal-2020-yerevanns,
    title = "{Y}ereva{NN}{'}s Systems for {WMT}20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs",
    author = "Hambardzumyan, Karen  and
      Tamoyan, Hovhannes  and
      Khachatrian, Hrant",
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.wmt-1.88",
    pages = "820--825",
}