Special token handling breaks idempotency of sentencepiece due to extra spaces

Question

Special token handling breaks idempotency of sentencepiece due to extra spaces

cat-state opened this issue a month ago · 4 comments

Sentenpiece tokenizers have the property that Decode(Encode(Normalize(input))) == Normalize(input).. This property is very useful when combining and re-inferring prompts. However, when used through tokenizers with special tokens added for BOS/EOS etc, tokenizers will inject an extra space around special tokens when decoding - i.e, <s>A will become <s> A, which when encoded and decoded will become <s> A, <s> A, etc.

A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: #1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.

There could be two fixes for this: either not add the extra space, or tokenize <s> A the same as <s>A (i think could be accomplished by changing the AddedToken params for these tokens.

Answer 1 · 2024-05-17T12:05:57.000Z

Do you have a reproducer?
I'd love to fix it, but I'm not sure this is still happening

Answer 2 · 2024-05-17T12:06:27.000Z

Llama based tokenizer don't have this issue anymore and was fixed by the metaspace refactoring.

Answer 3 · 2024-05-17T12:06:54.000Z

Are you using legacy=False (mistral does not)

Answer 4 · 2024-05-17T12:12:22.000Z

Also the snipper shared:

from transformers import LlamaTokenizer
model_id = "lmsys/vicuna-13b-delta-v1.1"
tokenizer = LlamaTokenizer.from_pretrained(model_id, add_bos_token = False, )
message = "<s>hello</s>"
decoded = tokenizer.decode(tokenizer(message)['input_ids'])
print(decoded, decoded == message)

this is on transformers side. Not tokenizers. I'll open a PR right away, it's super weird that it was not caught up until now