In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new summarization datasets in the domain of newswire. We also introduce a parallel Catalan to English corpus, paired with three different brand new test sets. Finally, we evaluate the data presented with competing state of the art models, and we develop baselines for these tasks using a newly created Catalan BART. We release the resulting resources of this work under open license to encourage the development of language technology in Catalan.
We openly release the outcome materials produced in the framework of this publication:
- BART-base-ca, a BART-based Catalan language model
- CaSum, a Catalan abstrative summaritzation dataset
- VilaSum, a Catalan abstrative summaritzation testsets
- BART-base-ca-casum, a Catalan abstractive summarization model
- GEnCaTA, a Catalan-English high quality corpus for MT
- Evaluation Resources for Catalan-English MT
If you use any of these resources (datasets or models) in your work, please cite our latest preprint:
@misc{degibert2022sequencetosequence,
title={Sequence-to-Sequence Resources for Catalan},
author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero},
year={2022},
eprint={2202.06871},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
MIT License
Copyright (c) 2022 Text Mining Unit at BSC