AfroXLMR: Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning
This repository contains code and data for our COLING22 paper Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning.
In this paper, we propose multilingual adaptive fine-tuning (MAFT) as a method for simultaneously adapting multilingually pre-trained language models (PLMs) on 17 of Africa's most resourced languages and three other high-resource languages widely spoken on the continent. This is crucial since when utilized on downstream tasks, these PLMS typically show a considerable decline in performance on languages not seen during pretraining. Furthermore, as part of the contribution of this paper, we showed that we could specialize PLM to African languages by removing vocabulary tokens from the embedding layer of this PLM that does not correspond to African language scripts, thus effectively reducing the model size by 50%. We compared MAFT to a comparable strategy known as language adaptive fine-tuning (LAFT), and we tested our suggested approaches on three distinct NLP tasks for which African language datasets exist. Furthermore, in order to ensure that our suggested techniques are tested on typologically varied languages, we curated a unique dataset called ANTC- African News Topic Classification dataset, which contains five African languages.
We release 2 kinds of pretrained models (they include both the base and large versions):
- XLM-R + MAFT on 20 languages (i.e AfroXLMR)
- XLM-R + LAFT on 20 languages
These models can be downloaded from huggingface
Parameter Efficient Models:
- Sparse Fine-tunings on AfroXLMR
- MAD-X 2.0 Adapters on AfroXLMR on Zenodo
Monolingual texts used to train adapters and sparse fine-tunings can be found on Zenodo
In this work, we evaluated our models on three downstream tasks
NER: To obtain the NER dataset, please download it from this repository
Text Classification: To obtain the topic classification dataset, please download it from this repository. Also in this repo we have included the newly created text classification dataset for 5 African languages.
Sentiment Analysis: To obtain the sentiment classification dataset, please download it from this repository
To perform MAFT or LAFT, we have provided the training scripts and instruction in ./AdaptiveFinetuning/
. Follow the instruction and run the command
bash train.sh
For vocabulary reduction follow the instruction in ./VocabReduction/
. You need to follow two steps. Sub-token collection and the removal of unwanted sub-token from the PLM's vocabulary.
For downstream tasks, see ./ClassificationTasks/
.
If you find this reposity useful, please consider citing our paper.
@inproceedings{alabi-etal-2022-adapting,
title = "Adapting Pre-trained Language Models to {A}frican Languages via Multilingual Adaptive Fine-Tuning",
author = "Alabi, Jesujoba O. and
Adelani, David Ifeoluwa and
Mosbach, Marius and
Klakow, Dietrich",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.382",
pages = "4336--4349",
}