Arabic tokenization is broken

Question

Arabic tokenization is broken

gahdritz opened this issue 2 years ago · 2 comments

The T5 tokenizer doesn't seem to know what to do with all of the Arabic text in the dataset (along with some other non-Latin scripts). Here's an example from Natural Instructions followed by its tokenization:

Example:

Teacher:You are given a sentence in Arabic. Your job is to translate the Arabic sentence into English.
Teacher: Now, understand the problem? Solve this instance: الممرضة في عيادة ذات ضغط نرى فيها من 50 إلى 100 مريض يومياً ، يترك لها فقط بضع دقائق لتقضيها مع كل مريض — دقائق لكل مريض.
Student:

Tokenization:

[17476    10  3774    33   787     3     9  7142    16 19248     5   696
   613    19    12 13959     8 19248  7142   139  1566     5 17476    10
   852     6   734     8   682    58  5175   162    48  3421    10     3
     2     3     2     3     2     3     2     3     2     3     2     3
     2     3     2   943     3     2   910     3     2     3     2     3
     2     3     2     3     2     3     2     3     2     3     2     3
     2     3     2     3     2     3     2     3   318     3     2     3
     2     3     2     5  6341    10     1]

Is this known? Were scripts not in the T5 tokenizer excluded from FLAN runs?

Answer 1 · 2023-04-01T12:45:55.000Z

Hello gahdritz, any luck in finding a solution for this problem?

Answer 2 · 2023-04-11T03:04:27.000Z

@gahdritz Sorry for the delay on answering this -- I missed it. The T5 tokenizer is just for English so cannot reliably handle other languages/scripts. The Flan Collection does include multiple languages (especially within the NIv2 submixture), but we did not exclude this training data in our runs (either for Flan-T5 or Flan-PaLM, where the latter is multilingual).

To accommodate multilingual datasets you can either change the tokenizer used in the Flan Collection, or just save the pretokenized inputs and outputs yourself and tokenize them later.