huggingface/tokenizers

`BertWordPieceTokenizer` not saving with `sep_token` marked

AngledLuffa opened this issue · 2 comments

If I run the following, then try to reload the tokenizer using from_file, I get an error of sep_token not being part of the vocabulary

from tokenizers import BertWordPieceTokenizer

paths = [...]
tokenizer = BertWordPieceTokenizer(lowercase=False)
tokenizer.train(files=paths, vocab_size=32000, min_frequency=3,
                special_tokens=["[UNK]", "[PAD]", "[SEP]", "[MASK]", "[CLS]"])
tokenizer.save(OUT_FILE)
tokenizer.save_model("./zzzzz", "test")

tokenizer = BertWordPieceTokenizer.from_file(OUT_FILE)

Traceback (most recent call last):
    tokenizer = BertWordPieceTokenizer.from_file(tokenizer_checkpoint)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.11/site-packages/tokenizers/implementations/bert_wordpiece.py", line 84, in from_file
    return BertWordPieceTokenizer(vocab, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.11/site-packages/tokenizers/implementations/bert_wordpiece.py", line 57, in __init__
    raise TypeError("sep_token not found in the vocabulary")
TypeError: sep_token not found in the vocabulary

I can see that some of the necessary steps, such as adding a post_processor, occur if the vocab is already specified. Surely I'm not supposed to pass in a vocab before training, though... What about adding a post_processor in some way? Except I don't see any way to get the special token ids out of the tokenizer, either before or after it's created.

Am I expected to pass in the vocab before creating the Tokenizer? Am I supposed to add the sep_token manually, or manually create the post_processor?

If I look at the .json file written out, I can see there even is a [SEP] token in there, but it's not listed as sep_token

    {
      "id": 2,
      "content": "[SEP]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },

Even if I manually add sep_token to the json, it doesn't work

so...

if I then read the tokenizer file back in with the same code path used in from_file, such as

>>> vocab = WordPiece.read_file(OUT_FILE)

It doesn't properly read the tokenizer vocabulary, so probably this isn't how I'm supposed to do it

If I use the results of save_model from above, the file zzzzz/test-vocab.txt, then it successfully loads the tokenizer back in. However, that directory only has the one file in it, and interesting pieces like the postprocessor have all been lost in the process.

Is there something I'm doing wrong, or some bug with the save / load process in this tokenizer?

As a random aside, after creating the tokenizer, BertTokenizerFast and the like can be called directly on a text, such as tokenizer(text). With the BertWordPieceTokenizer, it simply isn't possible. What should I call instead?

Is it possible I'm just not supposed to use BertWordPieceTokenizer, and BertTokenizerFast should be used instead? If so, how do I train that? AutoTokenizer doesn't have any way to load the zzzzz directory from above, since it only has the vocab.txt file in it and no config file

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

OSError: zzzzz does not appear to have a file named config.json. Checkout 'https://huggingface.co/zzzzz/tree/None' for available files.

I may have figured out how to build a BertTokenizerFast

Basically, just need to wrap the BertWordPieceTokenizer in a BertTokenizerFast before saving

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
new_tokenizer.save_pretrained("zzzzz")

Now the zzzzz directory can be loaded for training the new Bert model

Glad that you found the answer and sorry for not helping earlier!