How to create a custom dataset for machine translation

Question

How to create a custom dataset for machine translation

BabylenMagnus opened this issue 2 years ago · 4 comments

what to do if a phrase has several translations.
Will this dataset be correct?
train.src:

I like the color green.
I like the color green.

train.tgt

Мне нравится зелёный цвет.
Я люблю зелёный цвет.

how will the BLEU metric work on it?

Answer 1 · 2023-01-20T08:03:54.000Z

When you utilize it for training, you can create train.src and train.tgt by spliting them, just like what you do.
When validation and testing, BLEU supports to conduct the score with multiple refernece, you should write the unique in the valid/test.src and their corresponding translations with repr expressions in lists in valid/test.tgt. For example:

valid/test.src

abc
edf

valid/test.tgt

"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"

The target translations may not need to be the same number.

Answer 2 · 2023-01-20T10:56:13.000Z

Ok, thanks.
would it be correct to swap them?
valid/test.src

"['ABC', 'Abc']"
"['DEF', 'Def', 'DEf']"

valid/test.tgt

abc
edf

can I do it from config and not change the file extensions?
If I have dataset A->B, but need model B->A

Answer 3 · 2023-01-20T11:27:20.000Z

No, you cannot swap them as you showed. You need to split them one-by-one. It only supports one-to-one or one-to-many, rather than many-to-one.

If you have dataset A->B, but need model B->A. You need to create the dataset by yourself. Our config has not supported that.

Answer 4 · 2023-01-20T11:34:33.000Z

Thanks for help