`BertWordPieceTokenizer` not saving with `sep_token` marked
AngledLuffa opened this issue · 2 comments
If I run the following, then try to reload the tokenizer using from_file
, I get an error of sep_token
not being part of the vocabulary
from tokenizers import BertWordPieceTokenizer
paths = [...]
tokenizer = BertWordPieceTokenizer(lowercase=False)
tokenizer.train(files=paths, vocab_size=32000, min_frequency=3,
special_tokens=["[UNK]", "[PAD]", "[SEP]", "[MASK]", "[CLS]"])
tokenizer.save(OUT_FILE)
tokenizer.save_model("./zzzzz", "test")
tokenizer = BertWordPieceTokenizer.from_file(OUT_FILE)
Traceback (most recent call last):
tokenizer = BertWordPieceTokenizer.from_file(tokenizer_checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nlp/scr/horatio/miniconda3/lib/python3.11/site-packages/tokenizers/implementations/bert_wordpiece.py", line 84, in from_file
return BertWordPieceTokenizer(vocab, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nlp/scr/horatio/miniconda3/lib/python3.11/site-packages/tokenizers/implementations/bert_wordpiece.py", line 57, in __init__
raise TypeError("sep_token not found in the vocabulary")
TypeError: sep_token not found in the vocabulary
I can see that some of the necessary steps, such as adding a post_processor
, occur if the vocab
is already specified. Surely I'm not supposed to pass in a vocab
before training, though... What about adding a post_processor
in some way? Except I don't see any way to get the special token ids out of the tokenizer, either before or after it's created.
Am I expected to pass in the vocab
before creating the Tokenizer? Am I supposed to add the sep_token
manually, or manually create the post_processor
?
If I look at the .json
file written out, I can see there even is a [SEP]
token in there, but it's not listed as sep_token
{
"id": 2,
"content": "[SEP]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
Even if I manually add sep_token
to the json, it doesn't work
so...
if I then read the tokenizer file back in with the same code path used in from_file
, such as
>>> vocab = WordPiece.read_file(OUT_FILE)
It doesn't properly read the tokenizer vocabulary, so probably this isn't how I'm supposed to do it
If I use the results of save_model
from above, the file zzzzz/test-vocab.txt
, then it successfully loads the tokenizer back in. However, that directory only has the one file in it, and interesting pieces like the postprocessor have all been lost in the process.
Is there something I'm doing wrong, or some bug with the save / load process in this tokenizer?
As a random aside, after creating the tokenizer, BertTokenizerFast
and the like can be called directly on a text, such as tokenizer(text)
. With the BertWordPieceTokenizer
, it simply isn't possible. What should I call instead?
Is it possible I'm just not supposed to use BertWordPieceTokenizer
, and BertTokenizerFast
should be used instead? If so, how do I train that? AutoTokenizer
doesn't have any way to load the zzzzz
directory from above, since it only has the vocab.txt
file in it and no config file
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
OSError: zzzzz does not appear to have a file named config.json. Checkout 'https://huggingface.co/zzzzz/tree/None' for available files.
I may have figured out how to build a BertTokenizerFast
Basically, just need to wrap the BertWordPieceTokenizer
in a BertTokenizerFast
before saving
new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
new_tokenizer.save_pretrained("zzzzz")
Now the zzzzz
directory can be loaded for training the new Bert model
Glad that you found the answer and sorry for not helping earlier!