/sib-200

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Primary LanguagePythonApache License 2.0Apache-2.0

This repository contains the annotated English dataset, the script to extend annotation to other languages and code to run baseline text classification models.

Required dependencies

  • python
    • transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
    • sklearn
    • evaluate
    • datasets
    • pandas
pip install -r code/requirements.txt

Create SIB dataset

sh get_flores_and_annotate.sh

or

Download it from huggingface dataset: Davlan/sib200

Run our baseline model using XLM-R

cd code/
sh xlmr_all.sh

BibTeX entry and citation info

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}