How to create a custom dataset for machine translation
BabylenMagnus opened this issue · 4 comments
what to do if a phrase has several translations.
Will this dataset be correct?
train.src
:
I like the color green.
I like the color green.
train.tgt
Мне нравится зелёный цвет.
Я люблю зелёный цвет.
how will the BLEU metric work on it?
When you utilize it for training, you can create train.src
and train.tgt
by spliting them, just like what you do.
When validation and testing, BLEU supports to conduct the score with multiple refernece, you should write the unique in the valid/test.src
and their corresponding translations with repr expressions in lists in valid/test.tgt
. For example:
valid/test.src
abc
edf
valid/test.tgt
"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"
The target translations may not need to be the same number.
Ok, thanks.
would it be correct to swap them?
valid/test.src
"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"
valid/test.tgt
abc
edf
can I do it from config and not change the file extensions?
If I have dataset A->B, but need model B->A
No, you cannot swap them as you showed. You need to split them one-by-one. It only supports one-to-one or one-to-many, rather than many-to-one.
If you have dataset A->B, but need model B->A. You need to create the dataset by yourself. Our config has not supported that.
Thanks for help