huggingface/tokenizers

Access utf-8 byte sequence for each token

DanielHesslow opened this issue · 2 comments

Hi,

It would be great if it was possible to get the utf-8 byte sequence corresponding to each token id.
Since tokenizers return strings, tokens which are not valid unicode strings by themselves will contain � on decode.

This eg. makes streaming and constrained generation much more difficult and error prone than it needs to be.

Additionally if we can get the uf8 byte sequence, decoding also get's much easier and faster, as it's simply a matter of concatenating the corresponding bytes.

Cheers,

Hey,

I ran into this issue, and wrote a blog post about it: https://stephantul.github.io/python/tokenizers/2023/03/16/bpe/

You can't directly take the byte representation of a token from the vocabulary. Basically, you have to use a specific char map to remap the bytes, and then decode those bytes. If you do this, you can just keep on concatenating them and decoding them.

I hope this helps!

This remapping is unfortunately not correct for all tokenizers, and there isn't actually a single mapping. Doing it correctly requires treating each internal decoder separately. It's very possible but it is error prone and subject to breaking on changes of the lib. It really needs to be part of the library.