/WSPAlign

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction, to appear at ACL 2023 main conference.

Primary LanguagePythonOtherNOASSERTION

WSPAlign

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction, published at ACL 2023 main conference.

This repository includes the source codes of paper WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction. Part of the implementation is from word_align. The implementation of inference and evaluation are at WSPAlign.InferEval.

Requirements

Run pip install -r requirements.txt to install all the required packages.

Model list

Model List Description
qiyuw/WSPAlign-xlm-base Pretrained on xlm-roberta
qiyuw/WSPAlign-mbert-base Pretrained on mBERT
qiyuw/WSPAlign-ft-kftt Finetuned with English-Japanese KFTT dataset
qiyuw/WSPAlign-ft-deen Finetuned with German-English dataset
qiyuw/WSPAlign-ft-enfr Finetuned with English-French dataset
qiyuw/WSPAlign-ft-roen Finetuned with Romanian-English dataset

Use our model checkpoints with huggingface

Note: For Japanese, Chinese, and other asian languages, we recommend to use mbert-based models like qiyuw/WSPAlign-mbert-base for better performance as we discussed in the original paper.

Data preparation

Dataset list Description
qiyuw/wspalign_pt_data Pre-training dataset
qiyuw/wspalign_ft_data Finetuning dataset
qiyuw/wspalign_few_ft_data Few-shot fintuning dataset
qiyuw/wspalign_test_data Test dataset for evaluation

Construction of Finetuning and Test dataset can be found at word_align.

Run download_dataset.sh to download all the above datasets.

Pre-train and finetune

You can do pre-train, finetune and evaluate by running the following scripts.

Pre-train

See pretrain.sh for details.

You can also use pre-traned model to directly do word alignment (zero-shot), see zero-shot.sh for details.

Finetune

See finetune.sh, fewshot.sh for details.

Evaluate

Refer to WSPAlign Inference for details.

Citation

If you use our code or model, please cite our paper:

@inproceedings{wu-etal-2023-wspalign,
    title = "{WSPA}lign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction",
    author = "Wu, Qiyu  and Nagata, Masaaki  and Tsuruoka, Yoshimasa",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.621",
    pages = "11084--11099",
}

License

This software is released under the CC-BY-NC-SA-4.0 License, see LICENSE.txt.