Prepare data for NLLB finetuning - what is the format of data files?
edchengg opened this issue · 2 comments
edchengg commented
Hello, @kauterry! Thanks a lot for the detailed answers on other issues.
I would like to prepare data for NLLB finetuning using the pre-trained SPM model.
My question is "what is the format of these files?" in the README.
For example, I am assuming the format is "one sentence at each line"?
$ tree $DATA_PATH
my_corpora
├── arb_Arab-eng_Latn
│ ├── mycorpus.arb_Arab.gz
│ └── mycorpus.eng_Latn.gz
└── eng_Latn-lij_Latn
├── nllbseed.eng_Latn.gz
├── nllbseed.lij_Latn.gz
├── tatoeba.eng_Latn.gz
└── tatoeba.lij_Latn.gz
kauterry commented
Yes, that is correct. src sentence and it's translation (tgt) sentence in the src, tgt files. One sentence per line.
edchengg commented
Thanks!