facebookresearch/stopes

Which pipeline is used specifically to preprocess input to NLLB model for inference?

pluiez opened this issue · 5 comments

Hi, I'm wondering which pipeline is exactly used to preprocess input texts fed to NLLB model for inference?

@jeanm can the prepare_data pipeline be used to prepare data for inference too?

@pluiez, I see that you wrote https://github.com/pluiez/NLLB-inference, is the code that you've come up with not working as you expect?

@pluiez, I see that you wrote https://github.com/pluiez/NLLB-inference, is the code that you've come up with not working as you expect?

Hi, I tear down the preprocessing steps into moses punctuation normalization and sentencepiece encoding, and I use https://github.com/facebookresearch/stopes/blob/main/stopes/utils/map_token_lang.tsv to get the language argument passed to moses normalize-punctuation.perl. However, some language codes are foundg missing in that tsv file, it's mentioned in this issue.

So I want to check that whether I'm using the correct steps for preprocessing raw text inputs to NLLB model for inference. I'm not sure which pipeline or config file is used explicitly for this purpose.

jeanm commented

Hi @pluiez!

@jeanm can the prepare_data pipeline be used to prepare data for inference too?

The prepare_data pipeline will be most useful for training large models (when needing to train a new sentencepiece model or get the data sharded). I wouldn't use it for translating interactively. The fairseq-interactive script is the way to go here.

However, some language codes are foundg missing in that tsv file, it's mentioned in pluiez/NLLB-inference#3.
So I want to check that whether I'm using the correct steps for preprocessing raw text inputs to NLLB model for inference. I'm not sure which pipeline or config file is used explicitly for this purpose.

Regarding the language codes, you'll probably have noticed that the MOSES normalize-punctuation.perl script has very few language-specific rules. The TSV file that you reference tries to map language codes to the closest two-letter language code supported by the script. In cases where we don't have a specific mapping, you can default that argument to en. The rest looks good!

Hi @pluiez!

@jeanm can the prepare_data pipeline be used to prepare data for inference too?

The prepare_data pipeline will be most useful for training large models (when needing to train a new sentencepiece model or get the data sharded). I wouldn't use it for translating interactively. The fairseq-interactive script is the way to go here.

However, some language codes are foundg missing in that tsv file, it's mentioned in pluiez/NLLB-inference#3.
So I want to check that whether I'm using the correct steps for preprocessing raw text inputs to NLLB model for inference. I'm not sure which pipeline or config file is used explicitly for this purpose.

Regarding the language codes, you'll probably have noticed that the MOSES normalize-punctuation.perl script has very few language-specific rules. The TSV file that you reference tries to map language codes to the closest two-letter language code supported by the script. In cases where we don't have a specific mapping, you can default that argument to en. The rest looks good!

Thank you for your explanation!