allenai/scibert

cased issue on Huggingface transformers tokenizer

monologg opened this issue · 4 comments

Hi:)

I was using the scibert_scivocab_cased model on Huggingface library, and I've found out that AutoTokenizer can't set do_lower_case option as False automatically.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
>>> tokenizer.tokenize("Hello World")
['hel', '##lo', 'world']
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)
>>> tokenizer.tokenize("Hello World")
['Hel', '##lo', 'World']

For AutoTokenizer or BertTokenizer to set do_lower_case=False automatically, it seems that tokenizer_config.json file should also be uploaded on model directory (Reference from Transformers library issue). The file should be written as below.

{"do_lower_case": false}

Or the method below (tokenizer.save_pretrained) will generate tokenizer_config.json. (also make special_tokens_map.json)

>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_cased', do_lower_case=False)
>>> tokenizer.basic_tokenizer.do_lower_case
False
>>> tokenizer.save_pretrained('./dir_to_save')
>>> tokenizer = BertTokenizer.from_pretrained('./dir_to_save')
>>> tokenizer.basic_tokenizer.do_lower_case
False

Can you please check this issue? Thank you for sharing the model:)

Hey @monologg, can you try using allenai/scibert_scivocab_uncased? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.

Hey @monologg, can you try using allenai/scibert_scivocab_uncased? These two models actually have different vocabularies/weights, so it's not just a matter of different Tokenizer setting.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
>>> tokenizer.basic_tokenizer.do_lower_case
True
>>> tokenizer.tokenize("Hello World")
['hell', '##o', 'world']

# Forcing uncased model tokenizer not to lowercase the sentence
>>> tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', do_lower_case=False)
>>> tokenizer.tokenize("Hello World")
['[UNK]', '[UNK]']

https://github.com/huggingface/transformers/blob/cf72479bf1/src/transformers/tokenization_bert.py#L163-L176

It seems that without giving additional argument or model is not in PRETRAINED_INIT_CONFIGURATION in tokenization_bert.py, BertTokenizer set do_lower_case as True which is default value.

I ran into the same problem and ruined my 6-days of pre-training because I wrongly assumed that do_lower_case=False would be set given that I am using the cased version of SciBERT...

I've opened a PR on huggingface to solve this issue, please have a look: https://huggingface.co/allenai/scibert_scivocab_cased/discussions/3