Is there a way to manage hyphenated compound words as a whole
fbottazzoli opened this issue · 3 comments
Hello,
I'm using PositionRank to extract keyphrases from a set of sentences often containing hyphenated compound workds (eg. state-of-the-art, business-as-usual, end-of-life). I would like them to be managed as whole but they are considered as separate words (I think this is due to the fact that - character is part of string.punctuation and then removed).
Is there a way to consider them as a single word?
Thank you
Francesca
Hi @fbottazzoli,
This is an issue from the spacy
tokenizer that, by default, splits on hyphens between letters (see https://spacy.io/usage/linguistic-features#native-tokenizer-additions).
I believe that I already (recently) resolved this issue by modifying the tokenizer behavior in pke
(in readers.py
commit a262f98):
>>> import pke
>>> extractor = pke.unsupervised.TopicRank()
>>> extractor.load_document(input='BERT is a state-of-the-art model.', language='en')
>>> extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")
>>> print(extractor.candidates.keys())
dict_keys(['bert', 'state-of-the-art model'])
Can you please update pke
and tell me if it works for you.
Best,
f.
Hello @boudinfl, yes, it works!
Thank you
Francesca
Great,
I am closing this issue then.
f.