docalign
This package reproduces the algorithm described by El-Kishky and Guzmán (2020)[1] to align massively multilingual document with cross-lingual Sentence-Mover’s Distance.
The code in this repository is focused on aligning English and Sinhala document pairs, but it can also be used with any other language pair with a few code level changes.
LASER PROJECT is used to get sentence embeddings as described in the original paper.
Build docalign
Install the python3 with pip3
pip install -r requirements.txt
Install sinling
Embed your own documents
For source document
python embedding_creator.py example/source.json ./example/se.json en
For target document
python embedding_creator.py example/target.json ./example/te.json si
Source and target documets should be in json format as follows.
[
{
"content": "doc1"
},
{
"content": "doc2"
}
]
Run docalign (using embeded documents)
python main.py ./example/se.json ./example/te.json
optional params :
python main.py --help
python embedding_creator.py --help
References
[1] Ahmed El-Kishky and Francisco Guzman. 2020. Massively multilingual document alignment with crosslingual sentence-mover’s distance