AutoTokenizer broken for v2 models

Question

AutoTokenizer broken for v2 models

shwang opened this issue 3 years ago · 2 comments

T5ForConditionalGeneration.from_pretrained(model_name) is OK, but AutoTokenizer fails with error:

~/apps/miniconda3/envs/safe-gpt/lib/python3.8/site-packages/transformers/models/t5/tokenization_t5_fast.py in __init__(self, vocab_file, tokenizer_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, **kwargs)
    126                 )
    127 
--> 128         super().__init__(
    129             vocab_file,
    130             tokenizer_file=tokenizer_file,

~/apps/miniconda3/envs/safe-gpt/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
    106         elif fast_tokenizer_file is not None and not from_slow:
    107             # We have a serialization from tokenizers which let us directly build the backend
--> 108             fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    109         elif slow_tokenizer is not None:
    110             # We need to convert a slow tokenizer to build the backend

Exception: No such file or directory (os error 2)

To reproduce this error, run the snippet:

from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "allenai/unifiedqa-v2-t5-small-1251000"

model = T5ForConditionalGeneration.from_pretrained(model_name)  # OK
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Fails

Answer 1 · 2022-03-07T19:55:35.000Z

save issue

Answer 2 · 2022-03-07T19:57:11.000Z

Thanks for bringing this to our attention.

I am actually not sure why this is failing, even though it seems to be working just fine with T5Tokenizer.

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    model_name = "allenai/unifiedqa-v2-t5-small-1251000"

    model = T5ForConditionalGeneration.from_pretrained(model_name)  # OK
    tokenizer = T5Tokenizer.from_pretrained(model_name)  # Works okay too  ¯\_(ツ)_/¯

For now, I'd suggest using T5Tokenizer instead.
Re: AutoTokenizer not sure if there is something we can do or whether HF folks should do.