[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected.
preeyank5 opened this issue · 7 comments
Description
While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file
Error Message
ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.
To Reproduce
from gluonnlp.data import tokenizers
tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')
Actually I can load the model:
import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer)
Output:
SentencepieceTokenizer(
model_path = /home/ubuntu/spm.model
lowercase = False, nbest = 0, alpha = 0.0
vocab = Vocab(size=3500, unk_token="<unk>", bos_token="<s>", eos_token="</s>", pad_token="<pad>")
)
@preeyank5 Would you try again?
I find that the root cause is that we will need better error handling of the **kwargs
here. Basically, the argument should be vocab
instead of vocab_path
and vocab_path
has been put under **kwargs
.
The way to fix the issue is to revise
gluon-nlp/src/gluonnlp/data/tokenizers/sentencepiece.py
Lines 99 to 101 in 08dc6ed
Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when kwargs
contains unexpected values.
Thanks Xingjian, I am now able to load the model
Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs.
Hi, i am new to this Project and would like to tackle this issue
Hi, i am new to this Project and would like to tackle this issue
Have you Solved it yet