Helsinki-NLP/Tatoeba-Challenge

Batch-mode prediction

Closed this issue · 2 comments

Hi,

Thank you for providing these tremendous resources. I'm currently trying to leverage the models that were uploaded to Hugginface (this one e.g.)

Is it expected not to be able to tokenize/generate in a batch-mode fashion?
See below an example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
inputs = tokenizer.encode("mango manzana y pera", return_tensors="pt")
inputs

tensor([[34090, 29312, 11, 306, 75, 0]])

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
inputs = tokenizer.encode(["mango manzana y pera"], return_tensors="pt")
inputs

tensor([[1, 0]])

Fucjivu

I am not sure how compatible the tokenizers from huggingface are with the SentencePiece unigram models that we provide for the models here that have been converted to their interfaces. This would be a question to ask at huggingface. Good luck!