AfricaNLP resources

List of all the resources we developed in collaboration with LSV and Masakhane during my doctoral studies and beyond

Labelled Datasets for AfricaNLP

Dataset Name	NLP Task	Link to Publication	Languages covered
MasakhaNER	named entity recognition	MasakhaNER: Named Entity Recognition for African Languages	amh, hau, ibo, kin, lug, luo, pcm, swa, wol, yor
MAFAND-MT	machine translation	A Few Thousand Translations Go a Long Way	amh, bam, bbj, ewe, fon, hau, ibo, kin, lug, luo, mos, nya, pcm, sna, swa, tsn, twi, wol, xho, yor, zul
ANTC	news-topic classification	multilingual adaptive fine-tuning (MAFT)	lin, pcm, mlg, som, zul
MENYO-20K	machine translation	MENYO-20k: A Multi-domain English–Yoruba Corpus for Machine Translation	yor
NaijaSenti	sentiment classification	NaijaSenti: A Nigerian Twitter Sentiment Corpus	hau, ibo, pcm, yor
Hausa and Yoruba News Topic	news-topic classification	Transfer Learning and Distant Supervision for Multilingual Transformer Models	hau, yor
Hausa VOA NER	named entity recognition	Transfer Learning and Distant Supervision for Multilingual Transformer Models	hau, yor
Yoruba GV NER	named entity recognition	Massive vs. Curated Word Embeddings for Low-Resourced Languages	yor

Unlabelled Corpus for AfricaNLP

African News corpus: Please cite our MAFT paper if you use it
AfroMAFT Corpus: Language Adaptation Corpus for 17 African languages, English, French and Arabic. Please cite the MAFAND paper if you use it. We use this corpus to train all the multilingual PLMs listed below

Multilingual Pre-trained Language Models

The models below are created using multilingual adaptive fine-tuning (MAFT) on XLMR-distilled model, XLM-R, mT5, ByT5 and mBART. We list the model, model size (in millions), and architecture. We cover the following 20 languages: afr, amh, ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, run, sna, som, sot, swa, xho, yor, zul

Model	Size (M)	architecture
AfroXLMR-mini	117M	Masked LM
AfroXLMR-small	140M	Masked LM
AfroXLMR-base	270M	Masked LM
AfroXLMR-large	550M	Masked LM
AfriMT5	580M	Seq-to-Seq
AfriByT5	580M	Seq-to-Seq
AfriMBART	610M	Seq-to-Seq

Language Adaptive Fine-tuning (LAFT) Models

The following PLMs are created by language adaptation to a language using monolingual corpus in that language. The monolingual corpus used to create them are described in the MasakhaNER paper and MAFT paper

Language	mBERT	XLM-R-base
amh	Davlan/bert-base-multilingual-cased-finetuned-amharic	Davlan/xlm-roberta-base-finetuned-amharic
hau	Davlan/bert-base-multilingual-cased-finetuned-hausa	Davlan/xlm-roberta-base-finetuned-hausa
ibo	Davlan/bert-base-multilingual-cased-finetuned-igbo	Davlan/xlm-roberta-base-finetuned-igbo
kin	Davlan/bert-base-multilingual-cased-finetuned-kinyarwanda	Davlan/xlm-roberta-base-finetuned-kinyarwanda
lin		Davlan/xlm-roberta-base-finetuned-lingala
lug	Davlan/bert-base-multilingual-cased-finetuned-luganda	Davlan/xlm-roberta-base-finetuned-luganda
luo	Davlan/bert-base-multilingual-cased-finetuned-luo	Davlan/xlm-roberta-base-finetuned-luo
mlg
nya		Davlan/xlm-roberta-base-finetuned-chichewa
pcm	Davlan/bert-base-multilingual-cased-finetuned-naija	Davlan/xlm-roberta-base-finetuned-naija
sna		Davlan/xlm-roberta-base-finetuned-shona
som		Davlan/xlm-roberta-base-finetuned-somali
swa	Davlan/bert-base-multilingual-cased-finetuned-swahili	Davlan/xlm-roberta-base-finetuned-swahili
wol	Davlan/bert-base-multilingual-cased-finetuned-wolof	Davlan/xlm-roberta-base-finetuned-wolof
xho		Davlan/xlm-roberta-base-finetuned-xhosa
yor	Davlan/bert-base-multilingual-cased-finetuned-yoruba	Davlan/xlm-roberta-base-finetuned-yoruba
zul		Davlan/xlm-roberta-base-finetuned-zulu

FastText Embeddings for African languages

We provide better quality word embeddings than the pre-trained FastText embeddings trained on Common crawl and Wikipedia. While we did not evaluate the quality on all the languages, our evaluation on Yoruba and Twi shows that they give better performance on word similarity tasks. The FastText embeddings are trained on curated data from JW300, Bible, VOA, BBC, and other news websites. Details of the data sources are in my PhD dissertation.

We trained the FastText embeddings using Gensim 3.8.1. All embedding models can be downloaded from Zenodo. Please, find the links below.