Tokenizer suggestion
TinoRed opened this issue · 1 comments
Hi @gsarti !
I am currently using your model t5-base-it
for a project and I was wondering if you could suggest me ( and future users ) which tokenizer class to use, since from your code I understand you are using a custom class SentencePieceUnigramTokenizer .
My current setup is the following :
- tokenizer = AutoTokenizer.from_pretrained('gsarti/t5-base-it')
- model = T5ForConditionalGeneration.from_pretrained('gsarti/t5-base-it')
I started retraining the model for my task and currently obtained some initial results with my current setup.
Your suggestions would be very valuable!
Thanks in advance!
Hi @TinoRed, thanks for reaching out!
First of all, the project is still a work-in-progress but I highly suggest using it5-base
instead of the t5-base-it
model you mentioned.1 I will update model cards in the near future with more information, but the it5-base
version of the model was trained for much longer and with much more text than its counterpart, so it should produce considerably better performances. Other sizes are also available on my HF Hub page.
The setup you mention is actually the optimal one: the model uses the SentencePieceUnigramTokenizer
class only for training, but we save it as a standard T5Tokenizer
model and so it is possible to load it using the AutoTokenizer.from_pretrained
method, which will be rerouted automatically to T5TokenizerFast.from_pretrained
. You can safely ignore warnings.
Footnotes
-
Please note that as mentioned on the
gsarti/t5-base-it
model card, the current model name will be deprecated and the new model will be found atgsarti/it5-base-oscar
starting from October 23rd, 2021. ↩