VinAIResearch/PhoBERT

Non-consecutive added token '{token}' found.

chicuong209 opened this issue · 10 comments

As the title. I meet the below error when using PhoBertTokenizer for Vietnamese Question Answering task. Could you please help me to fix it ? Thank you.
f"Non-consecutive added token '{token}' found. " AssertionError: Non-consecutive added token '<mask>' found. Should have index 5 but has index 64000 in saved vocabulary.
Btw, i have tried to set self.encoder[self.mask_token] = 4, the training process can run normally, but it doesn't seem a right way.

Please could you provide more details (data, scripts, .... as much as you can) ?
Probably you saved a dictionary and then tried to reload it?

Please could you provide more details (data, scripts, .... as much as you can) ?
Probably you saved a dictionary and then tried to reload it?

  1. The data is the SQuAD v1.1 dataset that was translated into Vietnamese. I use run_squad.py from huggingface examples, but I call directly PhoBertConfig, PhoBertTokenizer and PhoBertModelForQuestionAnswering instead of using AutoConfig, AutoTokenizer, and AutoModelForQuestionAnswering.
  2. Yes, the dictionary is saved and reloaded.

Then you should skip step 2.
Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

How come you'd need to save and reload the dictionary ? It's pretty weird :|

Then you should skip step 2.
Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

ok. I'll try it now

I have same issue with him. PhoBert model is ok but tokenizer was not found.
The error is as below:
OSError: Model name 'vinai/phobert-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/phobert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I think many people will meet this issue so I post it here :D thanks for your kindly response :D

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

@chicuong209 if there is any problem, you might want to follow the above instruction. I'm pretty sure PhoBERT would work without any loading issue.

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Thank you, it works for me.

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Thank you, it works for me. The problem is I think it will be download PhoBERT automatically when I run command to install transformers from pip.