What is the difference about the bbpe vocab decode method in minbpe against huggingface transformers?

Question

Opened this issue 10 months ago · 0 comments

Thanks for your nice work! I have a question after reading basic.py, and I want to figure out why...

In the save function implementation of basic.py, the BBPE vocab is saved through the decode() method. However, many tokens cannot be decoded into valid strings, so they are replaced with '�'.
But in HuggingFace Transformers, the vocab file of the BBPE tokenizer may not be decoded, such as in the BloomZ model, whose tokenizer adopts the BBPE method. If we have a Chinese string, string = "我爱**！", the result of tokenization is ['æĪĳçĪ±', 'ä¸Ńåįİ', 'ï¼ģ']. It is obvious that the vocab of the BloomZ tokenizer has tokens like ‘æĪĳçĪ±’, ‘ä¸Ńåįİ’, ‘ï¼ģ’. The encoding of the string is b"\xe6\x88\x91\xe7\x88\xb1\xe4\xb8\xad\xe5\x9b\xbd\xef\xbc\x81". Actually, the first token 'æĪĳçĪ±' corresponds to the byte sequence b'\xe6\x88\x91\xe7\x88\xb1', which should decode to "我爱". I guess the implementation in Transformers might be the per-byte mapping through the chr method, and I have tried it out. list(b'\xe6\x88\x91\xe7\x88\xb1') = [230, 136, 145, 231, 136, 177], and mapping it with the chr method, I get ['æ', '\x88', '\x91', 'ç', '\x88', '±'], but some tokens are not identical. For example, 'Ī' is not the same as '\x88'. There may be some mapping rules for the chr results, which are still not valid characters, like '\x88'.

Do you know why this is? Thanks again~