RUCAIBox/TextBox

How to create a custom dataset for machine translation

Closed this issue · 4 comments

what to do if a phrase has several translations.
Will this dataset be correct?
train.src:

I like the color green.
I like the color green.

train.tgt

Мне нравится зелёный цвет.
Я люблю зелёный цвет.

how will the BLEU metric work on it?

When you utilize it for training, you can create train.src and train.tgt by spliting them, just like what you do.
When validation and testing, BLEU supports to conduct the score with multiple refernece, you should write the unique in the valid/test.src and their corresponding translations with repr expressions in lists in valid/test.tgt. For example:

valid/test.src

abc
edf

valid/test.tgt

"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"

The target translations may not need to be the same number.

Ok, thanks.
would it be correct to swap them?
valid/test.src

"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"

valid/test.tgt

abc
edf

can I do it from config and not change the file extensions?
If I have dataset A->B, but need model B->A

No, you cannot swap them as you showed. You need to split them one-by-one. It only supports one-to-one or one-to-many, rather than many-to-one.

If you have dataset A->B, but need model B->A. You need to create the dataset by yourself. Our config has not supported that.

Thanks for help