Non-consecutive added token '{token}' found.

Question

Non-consecutive added token '{token}' found.

chicuong209 opened this issue 4 years ago · 10 comments

As the title. I meet the below error when using PhoBertTokenizer for Vietnamese Question Answering task. Could you please help me to fix it ? Thank you.
f"Non-consecutive added token '{token}' found. " AssertionError: Non-consecutive added token '<mask>' found. Should have index 5 but has index 64000 in saved vocabulary.
Btw, i have tried to set self.encoder[self.mask_token] = 4, the training process can run normally, but it doesn't seem a right way.

Answer 1 · 2020-09-16T04:13:25.000Z

Please could you provide more details (data, scripts, .... as much as you can) ?
Probably you saved a dictionary and then tried to reload it?

Answer 2 · 2020-09-16T04:36:52.000Z

Please could you provide more details (data, scripts, .... as much as you can) ?
Probably you saved a dictionary and then tried to reload it?

The data is the SQuAD v1.1 dataset that was translated into Vietnamese. I use run_squad.py from huggingface examples, but I call directly PhoBertConfig, PhoBertTokenizer and PhoBertModelForQuestionAnswering instead of using AutoConfig, AutoTokenizer, and AutoModelForQuestionAnswering.
Yes, the dictionary is saved and reloaded.

Answer 3 · 2020-09-16T04:42:24.000Z

Then you should skip step 2.
Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

Answer 4 · 2020-09-16T04:43:24.000Z

How come you'd need to save and reload the dictionary ? It's pretty weird :|

Answer 5 · 2020-09-16T05:01:34.000Z

Then you should skip step 2.
Download the dictionary and bpe files from https://huggingface.co/vinai/phobert-base#list-files and load the tokenizer using: tokenizer=PhoBertTokenizer(path-to-dictionay-file, path-to-bpe-file)

ok. I'll try it now

Answer 6 · 2020-09-21T02:48:04.000Z

I have same issue with him. PhoBert model is ok but tokenizer was not found.
The error is as below:
OSError: Model name 'vinai/phobert-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/phobert-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I think many people will meet this issue so I post it here :D thanks for your kindly response :D

Answer 7 · 2020-09-21T03:07:27.000Z

Please install transformers from its latest source:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .

And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Answer 8 · 2020-09-21T03:17:55.000Z

Please install transformers from its latest source:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .
And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

@chicuong209 if there is any problem, you might want to follow the above instruction. I'm pretty sure PhoBERT would work without any loading issue.

Answer 9 · 2020-09-21T06:22:53.000Z

Please install transformers from its latest source:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .
And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.

Thank you, it works for me.

Answer 10 · 2020-09-21T06:24:22.000Z

Please install transformers from its latest source:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install --upgrade .
And also clean/remove your transformers folder in ~/.cache/torch, so it'd automatically re-download PhoBERT properly. It should work.
Thank you, it works for me. The problem is I think it will be download PhoBERT automatically when I run command to install transformers from pip.