thunlp/OpenCLaP

Warning when saving vocabulary

Closed this issue · 1 comments

The warning shows when saving the models:

>>> tokenizer.save_vocabulary(model_dir)
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

There are some special spaces in the vocabulary. The package "pytorch-pretrained-bert" will regard these spaces as empty characters, so we suggest you write your own tokenizer or just ignore it. We will remove these spaces in vocab.txt later.