A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
-
TANZIL: A translated Quran to 42 languages, including African languages such as Amharic, Hausa, Somali, and Swahili.
-
MENYO-20k: A Yorùbá-English multi-domain parallel text dataset.
-
FFR: A Fon-French parallel text dataset.
-
Hausa Corpus: A Hausa-English parallel text dataset.
-
CCAligned: A parallel text dataset for English and 137 languages, including 30 African Languages.
-
ParaCrawl: A parallel text dataset for 41 languages, including Somali and Swahili.
-
WikiMatrix: A parallel text dataset for 85 languages, including Swahili, Malagasy, and Egyptian Arabic.
-
Ethiopian MT datasets: A parallel text dataset for English paired with 7 Ethiopian languages.
-
English-Luganda: An English-Luganda parallel text dataset.
-
French-Fon and French-Ewe: A parallel text dataset for French paired with Fon and Ewe.
-
Amharic-English: An Amharic-English parallel text dataset.
-
Tigrinya-English: A Tigrinya-English parallel text dataset (Free registration required).
-
Lingala-French: A Lingala-English parallel text dataset (Free registration required).
-
Congolese Swahili-French (Min,Small,Medium): Congolese Swahili-French parallel text datasets (Free registration required).
-
Swahili-French: A synthetic Swahili-French parallel text dataset (Free registration required).
-
English-Hausa (Min, Small): English-Hausa parallel text datasets (Free registration required).
-
English-Swahili: An English-Swahili parallel text dataset (Free registration required).
-
English-Kanuri: An English-Kanuri parallel text dataset (Free registration required).
-
English-Akuapem Twi: An English-Akwapem Twi parallel text dataset.
-
FLORES-101: A parallel text dataset for 101 languages, including 20 African languages.
-
isiXhosa-English: An isiXhosa-English parallel text dataset.
-
Tatoeba: A parallel text dataset for 409 languages, including 28 African languages.
-
Gnome: A technical domain parallel text dataset for 197 languages, including 16 African languages.
-
Ubuntu: A technical domain parallel text dataset for 244 languages, including 24 African languages.
-
OPUS-100: A parallel text dataset for 100 languages, including 9 African languages.
-
TICO-19: A covid-19 domain parallel text dataset for 37 languages, including 13 African languages.
-
Mozila localization: A parallel text dataset for 197 languages, including 18 African languages.
-
KINNEWS and KIRNEWS: News Classification datasets for Kinyarwanda (KINNEWS) and Kirundi (KIRNEWS).
-
Setswana and Sepedi: News classification datasets for Setswana and Sepedi.
-
Swahili News: A news classification dataset for Swahili.
-
Amharic News Text classification: News text classification dataset for Amharic.
-
VOA Hausa and BBC Yoruba news classification: News title classification dataset for Hausa and Yoruba.
- TUNIZI: A Tunizian Arabizi sentiment analysis dataset.
- NaijaSenti: A sentiment analysis dataset for Hausa, Igbo, Yoruba, and Nigerian Pidgin.
-
Amharic Summarization: A dataset for Amharic abstractive text summarization.
-
XL-Sum: A dataset for multilingual abstractive text summarization for 44 languages, including 10 African languages.
-
MasakhaNER: A dataset for Named Entity Recognition of 10 African languages.
-
WikiANN: A dataset for Named Entity Recognition for 282 languages, including several African languages.
-
Yoruba GV NER: Yoruba Named Entity Recognition dataset.
-
Hausa VOA NER: Hausa Named Entity Recognition dataset
-
ALFFA: An ASR dataset for Amharic, Hausa, Swahili, and Wolof.
-
AMMI ASR dataset: An ASR dataset for 19 Languages, including 16 African Languages.
-
CommonVoice: An ongoing ASR dataset project for 60 languages (as of May, 2021), including Kinyarwanda, Kabyle, Luganda, and Hausa.
-
Fon: An ASR dataset for Fon.
-
Swahili: A Swahili speech dataset (Free registration required).
-
Congolese Swahili: A Congolese Swahili speech dataset (Free registration required).
-
BembaSpeech: An ASR dataset for Bemba.
-
SPCS Speech: A Sepedi speech dataset.
-
SADiLaR TTS: ASR datasets for Afrikaans, Sesotho, Setswana, and isiXhosa.
-
NCHLT Speech: Speech datasets for South African's eleven official languages, including Afrikaans, Xitsonga, Setswana, Sesotho, Sepedi, isiZulu, Tshivenda, Siswati, isiXhosa, and isiNdebele.
-
IARPA Babel Swahili data: An ASR dataset for Swahili. (Require payment of $25)
-
Mboshi: Mboshi-French parallel speech dataset.
-
IWSLT 2021 Speech Translation: Speech translation datasets for Swahili - English and Congolese Swahili-French.
-
Swahili Language Modeling: A Swahili dataset for language modeling and additional datasets for Swahili Syllabic Alphabet and Swahili Word Analogy.
-
OSCAR: A multilingual dataset for 166 languages, including Amharic, Somalia, Yoruba, Egyptian Arabic, Malagasy, Swahili, and Afrikaans.
-
Luganda Agriculture data (Bukedde, Wikipedia): Monolingual datasets for Luganda in agricultural domain from Bukedde and Wikipedia.
-
isiXhosa: A monolingual dataset for isiXhosa.
-
mC4: A multilingual dataset for 101 languages, including 13 African languages.
-
MOT v1.0: A multilingual dataset for 44 languages, including 11 African languages.
-
ipa-dict: A Phonetic dictionary for 23 languages including Swahili.
-
za-lex: Lexical pronunciation datasets for 6 languages spoken is South Africa: Afrikaans, Southern Sotho, Xhosa, Zulu, SA English, and Tswana.
This is a growing list of NLP datasets for African languages. Please, if there is any publicly available dataset I missed out, kindly feel free to add it by doing a pull request, contacting me on Twitter, or emailing me at niyongabor.andre@gmail.com.