Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages.
- Extracts parallel corpora from two texts.
- Makes the formatted parallel book from it with sentence highlightning.
Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. You can also plug your own model using the interface described in models directory. Supported languages list depend on the selected backend model.
- distiluse-base-multilingual-cased-v2
- more reliable and fast
- moderate weights size — 500MB
- supports 50+ languages
- full list of supported languages can be found in this paper
- LaBSE (Language-agnostic BERT Sentence Embedding)
- can be used for rare languages
- pretty heavy weights — 1.8GB
- supports 100+ languages
- full list of supported languages can be found here
You can run the application on your computer using docker.
-
Make sure that docker is installed by typing the
docker version
command in your console. -
Images configured to run locally are available on Docker Hub.
-
Run the following commads in your console:
docker pull lingtrain/aligner:habr
docker run -p 80:80 lingtrain/aligner:habr
-
App will be available in your browser on the
localhost
address.
Clone this repo on your machine.
Flask/uwsgi backend REST API service. It's pretty simple and contains all the alignment logic.
cd /be
python main.py
SPA. Vue + vuex + vuetify. UI for managing alignment process using BE and a tool for translators to edit processing documents.
cd /fe
npm install
npm run serve
You can crate an issue or send me a message in telegram: @averkij
This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International license. See LICENSE.