Why YAKE misses COVID-19 keyword in output?
gvalchca opened this issue · 4 comments
Hi,
Why would YAKE not return the COVID-19 in any of the keywords in the following example:
occupational stress and mental health among anesthetists during the COVID-19 pandemic.
with default parameters, the output looks like this:
pandemic 0.04491197687864554
occupational stress 0.04940384002065631
stress and mental 0.09700399286574239
mental health 0.09700399286574239
health among anesthetists 0.09700399286574239
occupational 0.15831692877998726
stress 0.29736558256021506
mental 0.29736558256021506
health 0.29736558256021506
anesthetists 0.29736558256021506
Hi @gvalchca
This is current a limitation. It does not handle well enough tokens with special characters like -
nor digits.
In this case I would recommend normalising all COVID-19 mentions to simply COVID and it will work just fine.
Ideally we should improve the algorithm to manage this better. If you have ideas, please send us a PR :)
Hey, thanks for your answer and sorry to have duplicated the thread. However, the solution would not work for me cause in biology/medicine there are plenty of those abbreviations with meaningful numbers (e.g. IL2, IL6).