/salt

Language experimentation tools to accompany the SALT dataset

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Sunbird African Language Technology (SALT)

This package contains utilities and helper functions to accompany SALT datasets. These tools are intended to make it easy to experiment with open source NLP and speech technology in African languages.

  • Creation of multlingual datasets
  • Training and evaluation of multilingual models
  • Preprocessing data (augmentation, formatting)
  • Helper functions for training HuggingFace models

Datasets

SALT data is hosted on HuggingFace here. Currently this contains:

  • Translation data, with ~25,000 sentences translated between English, Luganda, Swahili, Ateso, Lugbara, Acholi and Runyankole.
  • Speech recognition data, with ~5,000 sentences read out by a variety of speakers in English (Ugandan accent), Luganda, Acholi, Ateso, Lugbara, Runyankole.
  • Text-to-speech data, with ~5,000 sentences read out by professional voice actors in a studio setting in English (Ugandan accent), English (Kenyan accent), Swahili, Luganda, Acholi, Ateso, Lugbara, Runyankole.

Publications

Multilingual Model and Data Resources for Text-To-Speech in Ugandan Languages. Isaac Owomugisha, Benjamin Akera, Ernest Tonny Mwebaze, John Quinn. 4th Workshop on African Natural Language Processing, 2023. [pdf]

Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda. Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, John Quinn. 3rd Workshop on African Natural Language Processing, 2022. [pdf]