Under-Represented Languages NLP

This repository contains data related to under-represented languages.

In particular, this repository contains data related to the following datasets and papers:

language_metadata: van Esch, D., Lucassen, T., Ruder, S., Caswell, I., & Rivera, C. (2022). Writing System and Speaker Metadata for 2,800+ Language Varieties. In Proceedings of LREC 2022.
mgsm: Shi, F. & Suzgun, M., et al. (2022). Language Models are Multilingual Chain-of-Thought Reasoners. arXiv preprint arXiv:2210.03057.
square_one_bias: Ruder, S., Vulić, I., & Søgaard, A. (2022). Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold. In Findings of the Association for Computational Linguistics: ACL 2022, 2340–2354.
tata: Gehrmann, S., Ruder, S., Nikolaev, V., Botha, J. A., Chavinda, M., Parikh, A., & Rivera, C. (2022). TaTA: A Multilingual Table-to-Text Dataset for African Languages. arXiv preprint.
GATITOS: Jones, A., Caswell, I., Saxena, I., Firat, O. (2023) BiLex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation. arXiv preprint.
FUN-LangID: The 1600+ language LangID model described here

google-research/url-nlp