Help with BertTokenizer.from_pretrained
ChrisPalmerNZ opened this issue · 4 comments
I am having difficulty using BertTokenizer.from_pretrained. This could be related to the fact that I am using pytorch_transformers rather than pytorch_pretrained_bert as my library (I haven't wanted to install the older library), maybe they are functionally different here? Is this something you are happy to look at? BTW if you would like a version of the BERT notebook reconfigured to use the new library I am happy to send it through, its just this one stumps me...
If I use the default suggested BertTokenizer.from_pretrained(OUTPUT_DIR + 'vocab.txt', do_lower_case=False)
in BERT_eval then I get a JSON decoder error, and I believe it could be related to the method expecting the class name as the first parameter, since I get info to this effect:
INFO:pytorch_transformers.tokenization_utils:Model name 'outputs/yelp/vocab.txt' not found in model shortcut name list
(bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking,
bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad,
bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc).
Assuming 'outputs/yelp/vocab.txt' is a path or url to a directory containing tokenizer files.
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt
INFO:pytorch_transformers.tokenization_utils:loading file outputs/yelp/vocab.txt
The error:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-18-152bb8aa9c91> in <module>
1 # Load pre-trained model tokenizer (vocabulary)
----> 2 tokenizer = BertTokenizer.from_pretrained(OUTPUT_DIR + 'vocab.txt', do_lower_case=False) #'bert-base-cased',
G:\Anaconda3\lib\site-packages\pytorch_transformers\tokenization_bert.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
198 kwargs['do_lower_case'] = True
199
--> 200 return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
201
202
G:\Anaconda3\lib\site-packages\pytorch_transformers\tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, cache_dir, *inputs, **kwargs)
232 kwargs[args_name] = file_path
233 if special_tokens_map_file is not None:
--> 234 special_tokens_map = json.load(open(special_tokens_map_file, encoding="utf-8"))
235 for key, value in special_tokens_map.items():
236 if key not in kwargs:
G:\Anaconda3\lib\json\__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
G:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
G:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
G:\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 2 (char 1)
Maybe its due to something different on Windows????
However, if I supply 'bert-base-cased'
as my first parameter, then I get the following INFO (but no error):
INFO:pytorch_transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache, downloading to C:\Users\User\AppData\Local\Temp\tmp7ywr01h0
100%|███████████████████████████████████████████████████████████████████████| 213450/213450 [00:00<00:00, 223147.77B/s]
INFO:pytorch_transformers.file_utils:copying C:\Users\User\AppData\Local\Temp\tmp7ywr01h0 to cache at outputs/yelp/vocab.txt\5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
ERROR:pytorch_transformers.tokenization_utils:Couldn't reach server to download vocabulary.
In this format the function does not seem to understand that I am trying to load a vocab.txt from disk.
The signature for the function in the pytorch_transformers is this:
Signature:
BertTokenizer.from_pretrained(
pretrained_model_name_or_path,
*inputs,
**kwargs,
)
Docstring:
Instantiate a BertTokenizer from pre-trained vocabulary files.
Source:
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
""" Instantiate a BertTokenizer from pre-trained vocabulary files.
"""
if pretrained_model_name_or_path in PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES:
if '-cased' in pretrained_model_name_or_path and kwargs.get('do_lower_case', True):
logger.warning("The pre-trained model you are loading is a cased model but you have not set "
"`do_lower_case` to False. We are setting `do_lower_case=False` for you but "
"you may want to check this behavior.")
kwargs['do_lower_case'] = False
elif '-cased' not in pretrained_model_name_or_path and not kwargs.get('do_lower_case', True):
logger.warning("The pre-trained model you are loading is an uncased model but you have set "
"`do_lower_case` to False. We are setting `do_lower_case=True` for you "
"but you may want to check this behavior.")
kwargs['do_lower_case'] = True
return super(BertTokenizer, cls)._from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File: g:\anaconda3\lib\site-packages\pytorch_transformers\tokenization_bert.py
Type: method
The updated library does have some differences (see here). When working with the updated version, I just let it download the vocab file since it's a tiny file. E.g:
tokenizer = tokenizer_class.from_pretrained('bert-base-cased')
Loading the model file is the exact same procedure.
config = config_class.from_pretrained('bert-base-cased')
model = model_class.from_pretrained('bert-base-cased')
The classes are BertConfig, BertForSequenceClassification, BertTokenizer
You will also need to modify your training code. To get the loss from the model output in the new library, you'll need something like this.
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuples in pytorch-transformers (see doc)
When loading a fine-tuned model, you don't need to compress into tar.gz. You can simply provide the path to the directory containing config.json
and pytorch_model.bin
.
Thanks @ThilinaRajapakse - I noticed that the vocab.txt exported by tokenizer.save_vocabulary
was identical to the bert-base-cased-vocab.txt file - so just using the default makes sense. But I was surprised, I had thought the vocab would have expanded to include new words found in the Yelp data - I guess this is where Bert might differ in how it handles out of vocab words... But, if you were wanting to load a customized vocab how would you do it?
For loading the model, yes I can see that just supplying the directory is required - it was the vocab loading that I was hung up on...
Regarding training, (and for reference for any readers) the other difference is that AdamW must be used instead of BertAdam, in conjunction with a scheduler:
Preparing the optimizer and scheduler
num_warmup_steps = int(WARMUP_PROPORTION * num_train_optimization_steps)
optimizer = AdamW(optimizer_grouped_parameters,
lr=LEARNING_RATE,
correct_bias=False) # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_train_optimization_steps)
In the training loop the scheduler.step() comes after the optimizer.step() if using Pytorch 1.1.0 or above, otherwise before it... see https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
There's a vocabulary class as well that you can use just like the config and the model classes. I think it's BertVocabulary or something similar. I'll check it later and edit this with the correct name.
I do have a working notebook from my research that uses the updated library. I could adapt that one to the yelp dataset but I was unsure whether adding it to this repo would help or whether it would confuse people even more. Maybe I'll write a separate article and a repo for it.
Thanks. Because I've only just come to evaluating Bert I haven't had any experience with the previous class, so I find it confusing to work with approaches that do! So, for me at least, a version that uses the latest approach would be welcome :)