Help with BertTokenizer.from_pretrained

Question

Help with BertTokenizer.from_pretrained

ChrisPalmerNZ opened this issue 5 years ago · 4 comments

I am having difficulty using BertTokenizer.from_pretrained. This could be related to the fact that I am using pytorch_transformers rather than pytorch_pretrained_bert as my library (I haven't wanted to install the older library), maybe they are functionally different here? Is this something you are happy to look at? BTW if you would like a version of the BERT notebook reconfigured to use the new library I am happy to send it through, its just this one stumps me...

If I use the default suggested BertTokenizer.from_pretrained(OUTPUT_DIR + 'vocab.txt', do_lower_case=False) in BERT_eval then I get a JSON decoder error, and I believe it could be related to the method expecting the class name as the first parameter, since I get info to this effect:

INFO:pytorch_transformers.tokenization_utils:Model name 'outputs/yelp/vocab.txt' not found in model shortcut name list 
(bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, 
bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, 
bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, 
bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). 
Assuming 'outputs/yelp/vocab.txt' is a path or url to a directory containing tokenizer files.
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt

The error:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-18-152bb8aa9c91> in <module>
      1 # Load pre-trained model tokenizer (vocabulary)
----> 2 tokenizer = BertTokenizer.from_pretrained(OUTPUT_DIR + 'vocab.txt', do_lower_case=False)  #'bert-base-cased',

G:\Anaconda3\lib\site-packages\pytorch_transformers\tokenization_bert.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    198                 kwargs['do_lower_case'] = True
    199 
--> 200         return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    201 
    202 

G:\Anaconda3\lib\site-packages\pytorch_transformers\tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, cache_dir, *inputs, **kwargs)
    232                 kwargs[args_name] = file_path
    233         if special_tokens_map_file is not None:
--> 234             special_tokens_map = json.load(open(special_tokens_map_file, encoding="utf-8"))
    235             for key, value in special_tokens_map.items():
    236                 if key not in kwargs:

G:\Anaconda3\lib\json\__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    297         cls=cls, object_hook=object_hook,
    298         parse_float=parse_float, parse_int=parse_int,
--> 299         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    300 
    301 

G:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    352             parse_int is None and parse_float is None and
    353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
    355     if cls is None:
    356         cls = JSONDecoder

G:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
    337 
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()
    341         if end != len(s):

G:\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Maybe its due to something different on Windows????

However, if I supply 'bert-base-cased' as my first parameter, then I get the following INFO (but no error):

INFO:pytorch_transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache, downloading to C:\Users\User\AppData\Local\Temp\tmp7ywr01h0
100%|███████████████████████████████████████████████████████████████████████| 213450/213450 [00:00<00:00, 223147.77B/s]
INFO:pytorch_transformers.file_utils:copying C:\Users\User\AppData\Local\Temp\tmp7ywr01h0 to cache at outputs/yelp/vocab.txt\5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
ERROR:pytorch_transformers.tokenization_utils:Couldn't reach server to download vocabulary.

In this format the function does not seem to understand that I am trying to load a vocab.txt from disk.

The signature for the function in the pytorch_transformers is this:

Signature:
BertTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    *inputs,
    **kwargs,
)
Docstring:
Instantiate a BertTokenizer from pre-trained vocabulary files.
        
Source:   
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
        """ Instantiate a BertTokenizer from pre-trained vocabulary files.
        """
        if pretrained_model_name_or_path in PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES:
            if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
                logger.warning("The pre-trained model you are loading is a cased model but you have not set "
                               "`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
                               "you may want to check this behavior.")
                kwargs['do_lower_case'] = False
            elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
                logger.warning("The pre-trained model you are loading is an uncased model but you have set "
                               "`do_lower_case` to False. We are setting `do_lower_case=True` for you "
                               "but you may want to check this behavior.")
                kwargs['do_lower_case'] = True

        return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File:      g:\anaconda3\lib\site-packages\pytorch_transformers\tokenization_bert.py
Type:      method

Answer 1 · 2019-08-18T07:43:14.000Z

The updated library does have some differences (see here). When working with the updated version, I just let it download the vocab file since it's a tiny file. E.g:

tokenizer = tokenizer_class.from_pretrained('bert-base-cased')

Loading the model file is the exact same procedure.
config = config_class.from_pretrained('bert-base-cased')
model = model_class.from_pretrained('bert-base-cased')

The classes are BertConfig, BertForSequenceClassification, BertTokenizer

You will also need to modify your training code. To get the loss from the model output in the new library, you'll need something like this.

outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuples in pytorch-transformers (see doc)

When loading a fine-tuned model, you don't need to compress into tar.gz. You can simply provide the path to the directory containing config.json and pytorch_model.bin.

Answer 2 · 2019-08-18T19:29:42.000Z

Thanks @ThilinaRajapakse - I noticed that the vocab.txt exported by tokenizer.save_vocabulary was identical to the bert-base-cased-vocab.txt file - so just using the default makes sense. But I was surprised, I had thought the vocab would have expanded to include new words found in the Yelp data - I guess this is where Bert might differ in how it handles out of vocab words... But, if you were wanting to load a customized vocab how would you do it?

For loading the model, yes I can see that just supplying the directory is required - it was the vocab loading that I was hung up on...

Regarding training, (and for reference for any readers) the other difference is that AdamW must be used instead of BertAdam, in conjunction with a scheduler:

Preparing the optimizer and scheduler

num_warmup_steps = int(WARMUP_PROPORTION * num_train_optimization_steps)
optimizer = AdamW(optimizer_grouped_parameters,
                  lr=LEARNING_RATE,
                  correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_train_optimization_steps)

In the training loop the scheduler.step() comes after the optimizer.step() if using Pytorch 1.1.0 or above, otherwise before it... see https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Answer 3 · 2019-08-18T19:43:07.000Z

There's a vocabulary class as well that you can use just like the config and the model classes. I think it's BertVocabulary or something similar. I'll check it later and edit this with the correct name.

I do have a working notebook from my research that uses the updated library. I could adapt that one to the yelp dataset but I was unsure whether adding it to this repo would help or whether it would confuse people even more. Maybe I'll write a separate article and a repo for it.

Answer 4 · 2019-08-18T20:58:25.000Z

Thanks. Because I've only just come to evaluating Bert I haven't had any experience with the previous class, so I find it confusing to work with approaches that do! So, for me at least, a version that uses the latest approach would be welcome :)