huggingface/tokenizers

Special token handling breaks idempotency of sentencepiece due to extra spaces

cat-state opened this issue · 4 comments

Sentenpiece tokenizers have the property that Decode(Encode(Normalize(input))) == Normalize(input).. This property is very useful when combining and re-inferring prompts. However, when used through tokenizers with special tokens added for BOS/EOS etc, tokenizers will inject an extra space around special tokens when decoding - i.e, <s>A will become <s> A, which when encoded and decoded will become <s> A, <s> A, etc.

A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: #1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.

There could be two fixes for this: either not add the extra space, or tokenize <s> A the same as <s>A (i think could be accomplished by changing the AddedToken params for these tokens.

Do you have a reproducer?
I'd love to fix it, but I'm not sure this is still happening

Llama based tokenizer don't have this issue anymore and was fixed by the metaspace refactoring.

Are you using legacy=False (mistral does not)

Also the snipper shared:

from transformers import LlamaTokenizer
model_id = "lmsys/vicuna-13b-delta-v1.1"
tokenizer = LlamaTokenizer.from_pretrained(model_id, add_bos_token = False, )
message = "<s>hello</s>"
decoded = tokenizer.decode(tokenizer(message)['input_ids'])
print(decoded, decoded == message)

this is on transformers side. Not tokenizers. I'll open a PR right away, it's super weird that it was not caught up until now