This repository contains source code for our EMNLP 2021 Findings paper: Subword Mapping and Anchoring across Languages.
In our paper we propose a novel method to construct bilingual subword vocabularies. We identify false positives (identical subwords with different meanings across languages) and false negatives (different subwords with similar meanings) as limitation of jointly constructed subword vocabularies. SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique and uses them to create cross-lingual anchors based on subword similarities.
We first learn subwords separately for each language and then train the corresponding embeddings. We then apply a mapping method to obtain similarity scores between the embeddings, which we use to extract alignments between subwords of the two languages. We finally tie the parameters of the aligned subwords to create anchors during training.
- Python 3.7.9
- Pytorch (tested on 1.6.0)
- FastText
- FastAlign (requires cmake)
- VecMap
- Transformers (tested on 4.1.0)
- Tokenizers (tested on 0.9.4)
Create Environment (Optional): Ideally, you should create an environment for the project.
conda create -n smala_env python=3.7.9
conda activate smala_env
Install PyTorch 1.6.0
:
conda install pytorch==1.6.0 torchvision==0.7.0 -c pytorch
Clone the project:
git clone https://github.com/GeorgeVern/smala.git
cd smala
Then install the rest of the requirements:
pip install -r requirements.txt
Install tools (*) necessary for data extraction, preprocessing and alignment:
bash install tools.sh
(*) You will have to change line 66 from the wikiextractor/WikiExtractor.py script: from .extract
-> from extract
otherwise you will get a relative import error.
Download and preprocess wikipedia data and learn language-specific for English (en) and another language, e.g. Greek (el):
bash get-mono-data.sh en
bash get-mono-data.sh el el-tokenizer
Learn subword embeddings for each language:
bash learn_subw_embs.sh en
bash learn_subw_embs.sh el el-tokenizer
Map the monolingual subword embeddings into a common space using the unsupervised version of VecMap, since we don't want to rely on seed dictionaries or identical (sub)words. Clone the github repo of (VecMap) and then run:
python3 vecmap/map_embeddings.py --unsupervised smala/data/mono/txt/en/WP/en.train.wp.vec smala/data/mono/txt/el/WP/el.train.wp.vec smala/data/mono/txt/en/WP/mapped_en_el_embs.txt smala/data/mono/txt/el/WP/mapped_el_embs.txt
Extract subword alignments from the mapped subword embeddings:
python3 extract_alignments.py --src_emb data/mono/txt/en/WP/mapped_en_el_embs.txt --tgt_emb data/mono/txt/el/WP/mapped_el_embs.txt --similarity cosine --alignment_dir en-el --initialize
Create new vocabulary for the target language (so that aligned subwords point to the same embedding in both langauges) based on the alignments:
python3 utils/create_new_vocabs.py --tgt_tokenizer el-tokenizer --model_type ours --alignment_dir alignments/en-el
Initialize the embedding layer of the target model:
python3 utils/init_weight.py --tgt_vocab alignments/en-el/new_tgt_vocab.txt --prob alignments/en-el/prob_vector --tgt_model emb_layer/el/bert-ours_align_embs
The above steps serve to employ SMALA with additional initialization of the non-aligned subwords (ours+align
in the paper). To compare with the other models that are included in the paper you need to modify these steps:
ours
: as above but run theextract_alignments.py
script without the flagand the--initialize~
init_weight.py
script with the--prob None
flag.joint
: skip the subword mapping and the first step of anchoring, run theextract_alignments.py
script with the--similarity surface_form
and without theflag, run the--initialize~
create_new_vocabs.py
script with the--model_type joint
flag and theinit_weight.py
script with the--prob None
flag.ramen
: skip the above steps, see RAMEN on how to create the probabilty vector (we also lowercase) and run theinit_weight.py
script with the correct--prob
flag and the original tokenizer (e.g.--tgt_vocab el-tokenizer/vocab.txt
)
Our method can also exploit parallel data (in the paper we use data from Europarl and United Nations). To do so you must first download (e.g. in data/para/en-el
) and preprocess (tokenize and lowercase) a parallel corpus. Then run:
python3 utils/apply_tokenizer.py --tokenizer bert --file data/para/en-el/en-el.en.txt
python3 utils/apply_tokenizer.py --tokenizer el-tokenizer --file data/para/en-el/en-el.el.txt
Then run FastAlign:
bash run_fast-align.sh en el data/para/en-el/WP/en-el.en.wp data/para/en-el/WP/en-el.el.wp data/para/en-el/WP/fast-align
To get the similarity matrix from fast-align output clone the RAMEN repo and run:
python3 ramen/code/alignment/get_prob_para.py --bitxt smala/data/para/en-el/WP/fast-align/cleared.en-el --align smala/data/para/en-el/WP/fast-align/align.en-el --save smala/data/para/en-el/WP/fast-align/probs.para.en-el.pth
Finally, to extract alignments, create new vocabulary and initialize the embedding layer of the target model, run:
python3 extract_alignments_para.py --tgt_tokenizer el-tokenizer --similarity_matrix data/para/en-el/WP/fast-align/probs.para.en-el.pth --alignment_dir en-el_fastalign
python3 utils/create_new_vocabs.py --tgt_tokenizer el-tokenizer --model_type ours --alignment_dir alignments/en-el_fastalign
python3 utils/init_weight.py --tgt_vocab alignments/en-el_fastalign/new_tgt_vocab.txt --prob alignments/en-el_fastalign/prob_vector --tgt_model emb_layer/el/bert-ours_align_para_embs
To transfer a pretrained LM to a new language using SMALA run:
python3 fine-tune_biBERTLM.py \
--tgt_lang el \
--output_dir ckpts/greek_ours_align \
--foreign_model emb_layer/el/bert-ours_align_embs \
--biLM_model_name ours \
--alignment_dir alignments/en-el \
--tgt_tokenizer_name alignments/en-el/new_tgt_vocab.txt \
--do_train --do_eval \
--evaluation_strategy steps \
--seed 12 \
--per_device_eval_batch_size 38 \
--max_steps 120000 \
--eval_steps 5000 \
--logging_steps 5000 \
--save_steps 5000 \
--per_device_train_batch_size 38 \
--eval_accumulation_steps 1
To fine-tune the transferred LM in XNLI (in English) run:
(Download XNLI 1.0
and XNLI-MT 1.0
files from XNLI repo and unzip them inside the data
folder)
python3 fine-tune_xnli.py \
--data_dir data/ \
--biLM_model_name ours \
--biLM ckpts/greek_ours_align/checkpoint-120000/ \
--foreign_model emb_layer/el/bert-ours_align_embs \
--language en \
--output_dir ckpts/greek_xnli_ours_align/ \
--tgt_tokenizer_name alignments/en-el/new_tgt_vocab.txt \
--alignment_dir alignments/en-el/ \
--do_train --do_eval \
--seed 12
To zero-shot test in the target language (e.g. Greek) run:
python3 fine-tune_xnli.py \
--data_dir data/ \
--biLM_model_name ours \
--biLM ckpts/greek_ours_align/checkpoint-120000/ \
--foreign_model emb_layer/el/bert-ours_align_embs \
--language el \
--output_dir ckpts/greek_xnli_ours_align/ \
--tgt_tokenizer_name alignments/en-el/new_tgt_vocab.txt \
--alignment_dir alignments/en-el/ \
--do_test \
--seed 12
To reproduce our results use seed 12
for LM training and seeds 12
, 93
, 2319
, 1210
and 21
for XNLI fine-tuning.
We would like to thank the community for releasing their code! This repository contains code from HuggingFace and from the RAMEN, VecMap, XLM and SimAlign repositories.
If you use this repo in your research, please cite the paper:
@misc{vernikos2021subword,
title={Subword Mapping and Anchoring across Languages},
author={Giorgos Vernikos and Andrei Popescu-Belis},
year={2021},
eprint={2109.04556},
archivePrefix={arXiv},
primaryClass={cs.CL}
}