docalign

This package reproduces the algorithm described by El-Kishky and Guzmán (2020)[1] to align massively multilingual document with cross-lingual Sentence-Mover’s Distance.
The code in this repository is focused on aligning English and Sinhala document pairs, but it can also be used with any other language pair with a few code level changes.
LASER PROJECT is used to get sentence embeddings as described in the original paper.

Build docalign

Install the python3 with pip3
pip install -r requirements.txt
Install sinling

Embed your own documents

For source document
python embedding_creator.py example/source.json ./example/se.json en

For target document
python embedding_creator.py example/target.json ./example/te.json si

Source and target documets should be in json format as follows.
[
    { "content": "doc1" },
    { "content": "doc2" }
]

Run docalign (using embeded documents)

python main.py ./example/se.json ./example/te.json

optional params :
python main.py --help
python embedding_creator.py --help

References

[1] Ahmed El-Kishky and Francisco Guzman. 2020. Massively multilingual document alignment with crosslingual sentence-mover’s distance