/pedra

Post-editing Datasets by Rakuten (PEDRa)

Primary LanguagePythonMIT LicenseMIT

Post-editing Datasets by Rakuten (PEDRa)

PEDRa contains publicly available neural machine translation post-edited datasets collected and compiled by Rakuten Institute of Technology. The datasets are released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Datasets

  1. SubEdits (English-German 160k triplets): A human-annotated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles from Rakuten Viki. Details about dataset collection and preprocessing can be found in the paper.

  2. SubEscape (English-German, 5.6m triplets): An artificial post-editing dataset created by translating OpenSubtitles2016 corpus Lison and Tiedemann, 2016 collected from www.opensubtitles.org/ using the in-house NMT system used for SubEdits and the references used as synthetic post-edits following the procedure used to compile eSCAPE (Negri et al., 2018).

License

The datasets are licensed under CC BY-NC-SA 4.0 (See LICENSE_DATA.md) The scripts provided with this repository are licensed under MIT License (see LICENSE.md)

Citation

If you use these datasets, please cite the following paper

@inproceedings{chollampatt2020pedra,
    title = "Can Automatic Post-editing Improve NMT?",
    author = "Chollampatt, Shamil  and
      Susanto, Raymond  and
      Tan, Liling and
      Szymanska, Ewa",
    booktitle = "Proceedings of EMNLP",
    year = "2020",
}

If you use SubEscape, which is derived from OpenSubtitles2016 corpus, please also cite (Lison and Tiedemann, 2016) and add a link to www.opensubtitles.org/