facebookresearch/SONAR

Is it necessary to supply the language to the tokenizer?

tstandley opened this issue · 2 comments

I've noticed that in the demo code the tokenizer is supplied with a language for every input. Is this necessary, and how does that effect what tokens are produced?

Yes, this is necessary.

Indicating the language results in the language code token prepended to the sequence of tokens, which in turn slightly affects the result of embedding computation.

If you do not know the language, a good default option would be make the model figure out a good language on its own. Currently, SONAR models don't have such an option, but you could approximate it but setting as a default language the one with a unique script (such as Greek), which makes it very easy for the model to figure out that the language tag is probably wrong and should be ignored.

Thanks for the info!