Tokenizer suggestion

Question

Tokenizer suggestion

TinoRed opened this issue 3 years ago · 1 comments

I am currently using your model t5-base-it for a project and I was wondering if you could suggest me ( and future users ) which tokenizer class to use, since from your code I understand you are using a custom class SentencePieceUnigramTokenizer .

My current setup is the following :

tokenizer = AutoTokenizer.from_pretrained('gsarti/t5-base-it')
model = T5ForConditionalGeneration.from_pretrained('gsarti/t5-base-it')

I started retraining the model for my task and currently obtained some initial results with my current setup.

Your suggestions would be very valuable!

Thanks in advance!

Answer 1 · 2021-10-08T12:51:28.000Z

Hi @TinoRed, thanks for reaching out!

First of all, the project is still a work-in-progress but I highly suggest using it5-base instead of the t5-base-it model you mentioned.¹ I will update model cards in the near future with more information, but the it5-base version of the model was trained for much longer and with much more text than its counterpart, so it should produce considerably better performances. Other sizes are also available on my HF Hub page.

The setup you mention is actually the optimal one: the model uses the SentencePieceUnigramTokenizer class only for training, but we save it as a standard T5Tokenizer model and so it is possible to load it using the AutoTokenizer.from_pretrained method, which will be rerouted automatically to T5TokenizerFast.from_pretrained. You can safely ignore warnings.

Please note that as mentioned on the gsarti/t5-base-it model card, the current model name will be deprecated and the new model will be found at gsarti/it5-base-oscar starting from October 23rd, 2021. ↩

Footnotes