different output of AutoTokenizer from that of T5tokenizer

Question

different output of AutoTokenizer from that of T5tokenizer

sm745052 opened this issue 3 months ago · 1 comments

transformers version: 4.38.1
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.8.1 (cpu)
Jax version: 0.4.23
JaxLib version: 0.4.23
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Hi, I found out that after adding a new token, say , both tokenizers behave differently.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-base")
tokenizer.add_tokens("<tk>")
text = 'hello<tk>'
encoded = tokenizer.encode(text, return_tensors='pt')
decoded_text = tokenizer.decode(encoded[0])
print("Decoded text:", decoded_text)

gives

Decoded text: hello<tk></s>

where as

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
tokenizer.add_tokens("<tk>")
text = 'hello<tk>'
encoded = tokenizer.encode(text, return_tensors='pt')
decoded_text = tokenizer.decode(encoded[0])
print("Decoded text:", decoded_text)

gives

Decoded text: hello <tk> </s>

Answer 1 · 2024-03-02T04:33:38.000Z

Hey, you should see the following warning:

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy = False) should be used. also:

- decoded_text = tokenizer.decode(encoded[0])
+ decoded_text = tokenizer.decode(encoded[0],spaces_between_special_tokens = False`)

closing as this is related to transformers not tokenizers