Sequence-to-sequence Resources for Catalan

In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new summarization datasets in the domain of newswire. We also introduce a parallel Catalan to English corpus, paired with three different brand new test sets. Finally, we evaluate the data presented with competing state of the art models, and we develop baselines for these tasks using a newly created Catalan BART. We release the resulting resources of this work under open license to encourage the development of language technology in Catalan.

Materials

We openly release the outcome materials produced in the framework of this publication:

BART-base-ca, a BART-based Catalan language model

Summarization

CaSum, a Catalan abstrative summaritzation dataset
VilaSum, a Catalan abstrative summaritzation testsets
BART-base-ca-casum, a Catalan abstractive summarization model

Machine Translation (soon)

GEnCaTA, a Catalan-English high quality corpus for MT
Evaluation Resources for Catalan-English MT

Citation

If you use any of these resources (datasets or models) in your work, please cite our latest preprint:

@misc{degibert2022sequencetosequence,
      title={Sequence-to-Sequence Resources for Catalan}, 
      author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero},
      year={2022},
      eprint={2202.06871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License