huggingface/tokenizers

Treatment of hyphenated words

rattle99 opened this issue · 1 comments

It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the word_ids() function.

For example, in the sentence

'To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth .'

Using

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

text = "To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth ."

text_tokenized = tokenizer(text , padding = 'longest' , truncation = True, return_tensors = "pt", is_split_into_words=False )
print(text_tokenized.word_ids())

returns
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, None]

but changing two-week to twoweek changes it to
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]

It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.