Is there a way to manage hyphenated compound words as a whole

Question

Is there a way to manage hyphenated compound words as a whole

fbottazzoli opened this issue 3 years ago · 3 comments

Hello,
I'm using PositionRank to extract keyphrases from a set of sentences often containing hyphenated compound workds (eg. state-of-the-art, business-as-usual, end-of-life). I would like them to be managed as whole but they are considered as separate words (I think this is due to the fact that - character is part of string.punctuation and then removed).
Is there a way to consider them as a single word?
Thank you
Francesca

Answer 1 · 2022-05-24T14:04:53.000Z

Hi @fbottazzoli,

This is an issue from the spacy tokenizer that, by default, splits on hyphens between letters (see https://spacy.io/usage/linguistic-features#native-tokenizer-additions).

I believe that I already (recently) resolved this issue by modifying the tokenizer behavior in pke (in readers.py commit a262f98):

>>> import pke
>>> extractor = pke.unsupervised.TopicRank()
>>> extractor.load_document(input='BERT is a state-of-the-art model.', language='en')
>>> extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")
>>> print(extractor.candidates.keys())
dict_keys(['bert', 'state-of-the-art model'])

Can you please update pkeand tell me if it works for you.

Best,

f.

Answer 2 · 2022-05-25T04:44:36.000Z

Hello @boudinfl, yes, it works!
Thank you
Francesca

Answer 3 · 2022-05-25T12:20:29.000Z

Great,

I am closing this issue then.

f.