Puebla_Nahuatl_Split

This is a copy of the official split for Puebla-Nahuatl dataset. For detailed references and how to use these files, please refer to espnet/egs/puebla_nahuatl or espnet/egs2/puebla_nahuatl

The dataset can be cited as

@inproceedings{shi-etal-2021-highland,
    title = "{H}ighland {P}uebla {N}ahuatl Speech Translation Corpus for Endangered Language Documentation",
    author = "Shi, Jiatong  and
      Amith, Jonathan D.  and
      Chang, Xuankai  and
      Dalmia, Siddharth  and
      Yan, Brian  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.americasnlp-1.7",
    doi = "10.18653/v1/2021.americasnlp-1.7",
    pages = "53--63",
    abstract = "Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR {\textgreater} MT) pipeline when translating endangered language documentation materials.",
}

ftshijt/Puebla_Nahuatl_Split

Puebla_Nahuatl_Split