Feng et al. (2020) proposed the LaBSE model, which is a multilingual sentence embedding model trained on 109 languages, including some Indic languages.
- Assign quality scores to the Parallel Corpus
- Extract high-quality Parallel Corpus from the noisy Pseudo-Parallel corpus
- Perform Sentence Alignment on a given misaligned Parallel corpus.
The output of all the above tasks are stored in a folder out
created in the current working directory
-
Install Sentence-Transformers
pip install sentence-transformers
-
Install pytorch with CUDA 11
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
python LaBSE-toolkit/sent_align.py --source test-en.txt --target test-mr.txt --batch_size 1000 --operation score
The Parallel sentences are extracted based on the threshold quality score provided.
python LaBSE-toolkit/sent_align.py --source test-en.txt --target test-mr.txt --batch_size 1000 --operation score --threshold 0.8
python LaBSE-toolkit/sent_align.py --source test-en.txt --target test-mr.txt --batch_size 1000 --operation sent-align --threshold 0.8
-src, --source: PATH to source file
-tgt, --target: PATH to target file
-th, --threshold: LaBSE threshold value for extracting high quality data
-b, --batch_size: batch_size for LaBSE scoring
-op, --operation: Select operation between score and sent-align
-mp, --model_path: Path to the saved LaBSE model