Warning when saving vocabulary

Question

Warning when saving vocabulary

Closed this issue 5 years ago · 1 comments

The warning shows when saving the models:

>>> tokenizer.save_vocabulary(model_dir)
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

Answer 1 · 2019-07-03T07:41:54.000Z

There are some special spaces in the vocabulary. The package "pytorch-pretrained-bert" will regard these spaces as empty characters, so we suggest you write your own tokenizer or just ignore it. We will remove these spaces in vocab.txt later.