cased issue on Huggingface transformers tokenizer
monologg opened this issue · 4 comments
Hi:)
I was using the scibert_scivocab_cased
model on Huggingface library, and I've found out that AutoTokenizer
can't set do_lower_case
option as False
automatically.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
>>> tokenizer.tokenize("Hello World")
['hel', '##lo', 'world']
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)
>>> tokenizer.tokenize("Hello World")
['Hel', '##lo', 'World']
For AutoTokenizer
or BertTokenizer
to set do_lower_case=False
automatically, it seems that tokenizer_config.json
file should also be uploaded on model directory (Reference from Transformers library issue). The file should be written as below.
{"do_lower_case": false}
Or the method below (tokenizer.save_pretrained) will generate tokenizer_config.json
. (also make special_tokens_map.json
)
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)
>>> tokenizer.basic_tokenizer.do_lower_case
False
>>> tokenizer.save_pretrained('./dir_to_save')
>>> tokenizer = BertTokenizer.from_pretrained('./dir_to_save')
>>> tokenizer.basic_tokenizer.do_lower_case
False
Can you please check this issue? Thank you for sharing the model:)
Hey @monologg, can you try using allenai/scibert_scivocab_uncased
? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.
Hey @monologg, can you try using
allenai/scibert_scivocab_uncased
? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
>>> tokenizer.basic_tokenizer.do_lower_case
True
>>> tokenizer.tokenize("Hello World")
['hell', '##o', 'world']
# Forcing uncased model tokenizer not to lowercase the sentence
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', do_lower_case=False)
>>> tokenizer.tokenize("Hello World")
['[UNK]', '[UNK]']
It seems that without giving additional argument or model is not in PRETRAINED_INIT_CONFIGURATION
in tokenization_bert.py
, BertTokenizer
set do_lower_case
as True
which is default value.
I ran into the same problem and ruined my 6-days of pre-training because I wrongly assumed that do_lower_case=False
would be set given that I am using the cased version of SciBERT...
I've opened a PR on huggingface to solve this issue, please have a look: https://huggingface.co/allenai/scibert_scivocab_cased/discussions/3