Simple low-code library to facilitate the translation of English corpora using API's or models.
In working with medical NLP in non-english languages we often encounter a lack of source material; ranging from raw un-labeled raw text, to labeled corpora in XML/JSONL format.
We have no doubt that this occurs in other fields as well. Hopefully this library can be of use to others.
Applications in mind are:
- translation of text-elements of XML/JSON objects
- translation of list of pairs/triplets/...
- inclusion of glossary, e.g. UMLS-translations
- translation of Panda dataframes
- There is a config file that contains the API-keys and other settings:
assets/config.json
- There is a glossary file to enforce translations of certain terms:
assets/glossary.json
- The main processing tasks are:
translate_text
: translate a single text-elementtranslate_list
: translate a list of text-elements
- The supporting tasks are:
load_config
: load the config fileload_glossary
: load the glossary fileupload_glossary
: upload the glossary to the cloud service (for Google Translate)load_model
: load a modelclean_text
: clean a text-elementclean_list
: clean a list of text-elements
The following HF models are recommended:
- yhavinga/ul2-large-en-nl
- yhavinga/ul2-base-nl36-en-nl
- yhavinga/ul2-base-en-nl
- michaelfeil/ct2fast-nllb-200-3.3B
- facebook/nllb-200-1.3B
- facebook/nllb-200-3.3B
- Transformers