This repository contains data related to under-represented languages.
In particular, this repository contains data related to the following datasets and papers:
-
language_metadata
: van Esch, D., Lucassen, T., Ruder, S., Caswell, I., & Rivera, C. (2022). Writing System and Speaker Metadata for 2,800+ Language Varieties. In Proceedings of LREC 2022. -
mgsm
: Shi, F. & Suzgun, M., et al. (2022). Language Models are Multilingual Chain-of-Thought Reasoners. arXiv preprint arXiv:2210.03057. -
square_one_bias
: Ruder, S., Vulić, I., & Søgaard, A. (2022). Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold. In Findings of the Association for Computational Linguistics: ACL 2022, 2340–2354. -
tata
: Gehrmann, S., Ruder, S., Nikolaev, V., Botha, J. A., Chavinda, M., Parikh, A., & Rivera, C. (2022). TaTA: A Multilingual Table-to-Text Dataset for African Languages. arXiv preprint. -
GATITOS
: Jones, A., Caswell, I., Saxena, I., Firat, O. (2023) BiLex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation. arXiv preprint. -
FUN-LangID
: The 1600+ language LangID model described here