We train an Indigenous language model adopting the unified and flexible text-to-text transfer Transformer (T5) approach . T5 treats every text-based language task as a “text-to-text" problem, taking text format as input and producing new text format as output. T5 is essentially an encoder-decoder Transformer , with the encoder and decoder similar in configuration and size to a BERTBase but with some architectural modifications. Modifications include applying a normalization layer before a sub-block and adding a pre-norm (i.e., initial input to the sub-block output).
IndT5 model (pytorch checkpoint) is available on Huggingface website for direct download and use exclusively for research.
We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.
Language | Language Code | Main Location | Number of Speakers |
---|---|---|---|
Aymara | aym | Bolivia | 1,677,100 |
Asháninka | cni | Peru | 35,200 |
Bribri | bzd | Costa Rica | 7,000 |
Guarani | gn | Paraguay | 6,652,790 |
Hñähñu | oto | Mexico | 88,500 |
Nahuatl | nah | Mexico | 410,000 |
Quechua | quy | Peru | 7,384,920 |
Rarámuri | tar | Mexico | 9,230 |
Shipibo-Konibo | shp | Peru | 22,500 |
Wixarika | hch | Mexico | 52,500 |
Target Language | Wiki Size (MB) | Wiki #Sentences | Bible Size (MB) | Bible #Sentences |
---|---|---|---|---|
Hñähñu | - | - | 1.4 | 7.5K |
Wixarika | - | - | 1.3 | 7.5K |
Nahuatl | 5.8 | 61.1K | 1.5 | 7.5K |
Guarani | 3.7 | 28.2K | 1.3 | 7.5K |
Bribri | - | - | 1.5 | 7.5K |
Rarámuri | - | - | 1.9 | 7.5K |
Quechua | 5.9 | 97.3K | 4.9 | 31.1K |
Aymara | 1.7 | 32.9K | 5 | 30.7K |
Shipibo-Konibo | - | - | 1 | 7.9K |
Asháninka | - | - | 1.4 | 7.8K |
Spanish | 1.13K | 5M | - | - |
Total | 1.15K | 5.22M | 19.8 | 125.3K |
The datasets are provided by AmericasNLP 2021 Shared Task on Open Machine Translation (https://github.com/AmericasNLP/americasnlp2021).
Language Pair | Train | Dev | Test |
---|---|---|---|
es-aym | 6,531 | 996 | 1,003 |
es-cni | 3,883 | 883 | 1,003 |
es-bzd | 7,506 | 996 | 1,003 |
es-gn | 26,032 | 995 | 1,003 |
es-oto | 4,889 | 599 | 1,003 |
es-nah | 16,145 | 672 | 1,003 |
es-quy | 125,008 | 996 | 1,003 |
es-tar | 14,720 | 995 | 1,003 |
es-shp | 14,592 | 996 | 1,003 |
es-hch | 8,966 | 994 | 1,003 |
title={IndT5: A Text-to-Text Transformer for 10 Indigenous Languages},
author={Chen, Wei-Rui and Abdul-Mageed, Muhammad and Cavusoglu, Hasan and others},
booktitle={Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas},
pages={265--271},
year={2021}
}