Arabic tokenization is broken
gahdritz opened this issue · 2 comments
The T5 tokenizer doesn't seem to know what to do with all of the Arabic text in the dataset (along with some other non-Latin scripts). Here's an example from Natural Instructions followed by its tokenization:
Example:
Teacher:You are given a sentence in Arabic. Your job is to translate the Arabic sentence into English.
Teacher: Now, understand the problem? Solve this instance: الممرضة في عيادة ذات ضغط نرى فيها من 50 إلى 100 مريض يومياً ، يترك لها فقط بضع دقائق لتقضيها مع كل مريض — دقائق لكل مريض.
Student:
Tokenization:
[17476 10 3774 33 787 3 9 7142 16 19248 5 696
613 19 12 13959 8 19248 7142 139 1566 5 17476 10
852 6 734 8 682 58 5175 162 48 3421 10 3
2 3 2 3 2 3 2 3 2 3 2 3
2 3 2 943 3 2 910 3 2 3 2 3
2 3 2 3 2 3 2 3 2 3 2 3
2 3 2 3 2 3 2 3 318 3 2 3
2 3 2 5 6341 10 1]
Is this known? Were scripts not in the T5 tokenizer excluded from FLAN runs?
Hello gahdritz, any luck in finding a solution for this problem?
@gahdritz Sorry for the delay on answering this -- I missed it. The T5 tokenizer is just for English so cannot reliably handle other languages/scripts. The Flan Collection does include multiple languages (especially within the NIv2 submixture), but we did not exclude this training data in our runs (either for Flan-T5 or Flan-PaLM, where the latter is multilingual).
To accommodate multilingual datasets you can either change the tokenizer used in the Flan Collection, or just save the pretokenized inputs and outputs yourself and tokenize them later.