sib-200: A Python repository from dadelani

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

This repository contains the annotated English dataset, the script to extend annotation to other languages and code to run baseline text classification models.

Required dependencies

python
- transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
- sklearn
- evaluate
- datasets
- pandas

pip install -r code/requirements.txt

Create SIB dataset

sh get_flores_and_annotate.sh

Download it from huggingface dataset: Davlan/sib200

Run our baseline model using XLM-R

cd code/
sh xlmr_all.sh

BibTeX entry and citation info

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

dadelani/sib-200

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Required dependencies

Create SIB dataset

Run our baseline model using XLM-R

BibTeX entry and citation info