Treatment of hyphenated words

Question

Treatment of hyphenated words

rattle99 opened this issue 2 months ago · 1 comments

It seems huggingface tokenizers, treats hyphenated words as separate, including the hyphen with reference to the word_ids() function.

For example, in the sentence

'To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth .'

Using

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

text = "To win the money , SpaceShipOne had to blast off into space twice in a two-week period and fly at least 100 kilometers above Earth ."

text_tokenized = tokenizer(text , padding = 'longest' , truncation = True, return_tensors = "pt", is_split_into_words=False )
print(text_tokenized.word_ids())

returns
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, None]

but changing two-week to twoweek changes it to
[None, 0, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, None]

It is possible that this behavior is seen for symbols other than hyphen as well which is worth keeping in mind while doing NER tasks. Or perhaps this is by design?

Answer 1 · 2024-05-20T01:50:25.000Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.