/lowresCCWR

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

Method

Install

pip install -r requirements.txt

Run

See scripts/ for examples scripts:

  • ./scripts/example_vocab_extend.sh: for building vocabulary extended LMs (eBERT)
  • ./scripts/example_unsup_alignment.sh: for unsupervised mining based alignment and evaluation
  • ./scripts/example_ner.sh: for NER evaluation
  • ./scripts/mine_word_pairs.sh: for mining word pairs from a given source and target dataset pair

For further details please look at the relevant scripts and the parameter descriptions in the used python scripts.

External Resources and Tools

Data

We used the following datasets for the experiments:

Tokenizers

The datasets are tokenized with the following tools:

  • English: Moses Tokenizer: https://github.com/moses-smt/mosesdecoder
  • Nepali: Tokenizer and Normalizer from indic-nlp: https://github.com/anoopkunchukuttan/indic_nlp_library
  • Swahili: Moses Tokenizer
  • Malayalam: Tokenizer and Normalizer from indic-nlp
  • Sinhala: Tokenizer and Normalizer from indic-nlp
  • Maori: Moses Tokenizer
  • Sindhi: Moses Tokenizer
  • Amharic: https://github.com/uhh-lt/amharicprocessor
  • Gujrati: Tokenizer and Normalizer from indic-nlp
  • Kannada: Tokenizer and Normalizer from indic-nlp
  • Bengali: Tokenizer and Normalizer from indic-nlp
  • Afrikaans: Moses Tokenizer
  • Macedonian: Moses Tokenizer
  • Basque: Moses Tokenizer
  • Bulgarian: Moses Tokenizer
  • Nepali: Tokenizer and Normalizer from indic-nlp

Cite

Related publications:

[1] Viktor Hangya, Hossain Shaikh Saadi, and Alexander Fraser. 2022. Improving Low-Resource Languages in Pre-Trained Multilingual Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11993–12006

@inproceedings{hangya-etal-2022-improving,
    title = "Improving Low-Resource Languages in Pre-Trained Multilingual Language Models",
    author = "Hangya, Viktor  and
      Saadi, Hossain Shaikh  and
      Fraser, Alexander",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.822",
    pages = "11993--12006",
}