facebookresearch/stopes

Prepare data for NLLB finetuning - what is the format of data files?

edchengg opened this issue · 2 comments

Hello, @kauterry! Thanks a lot for the detailed answers on other issues.
I would like to prepare data for NLLB finetuning using the pre-trained SPM model.
My question is "what is the format of these files?" in the README.
For example, I am assuming the format is "one sentence at each line"?

$ tree $DATA_PATH
my_corpora
├── arb_Arab-eng_Latn
│   ├── mycorpus.arb_Arab.gz
│   └── mycorpus.eng_Latn.gz
└── eng_Latn-lij_Latn
    ├── nllbseed.eng_Latn.gz
    ├── nllbseed.lij_Latn.gz
    ├── tatoeba.eng_Latn.gz
    └── tatoeba.lij_Latn.gz

Yes, that is correct. src sentence and it's translation (tgt) sentence in the src, tgt files. One sentence per line.

Thanks!