This repository contains a novel dataset for semantically appropriate application of lexical constraints in NMT. Detailed descriptions are provided in the paper:
Towards Accurate Translation via Semantically Appropriate Application of Lexical Constraints
- Yujin Baek*, Koanho Lee*, Dayeon Ki, Cheonbok Park, Hyoung-Gyu Lee, and Jaegul Choo
- Findings of ACL 2023
Each test example (in data/holly_test.json
) consists of 7 items. Here, we provide a brief explanation for each item based on the following images.
guid
: a unique identifier for each examplehomograph_word
: a homograph word identified in thesource_sentence
(양수
)source_sentence
: a source sentence which should be translatedlexical_constraint
:source_term
:양수
target_term
:amniotic fluid
positive_references
: two positive examples demonstrating the specific use of thehomograph_word
translation
: a desirable translation of thesource_sentence
label
:1
(for positive examples) or0
(for negative examples)
We also provide train/validation examples for developing homograph disambiguation module:
data/holly_train.json
data/holly_valid.json
Licensed under CC BY-SA KR 2.0
HOLLY-benchmark
Copyright (c) 2023-present NAVER Cloud Corp.
Creative Commons Attribution-ShareAlike 2.0 Generic license
A summary of the CC BY-SA 2.0 license is located here:
https://creativecommons.org/licenses/by-sa/2.0/
https://creativecommons.org/licenses/by-sa/2.0/kr/ (KR)
We modified following soruces in order to create our benchmark dataset. Specifically, source sentences and positive references were collected from three open-source dictionaries from the National Institute of Korean Language: