
Why YAKE misses COVID-19 keyword in output?

gvalchca opened this issue · 4 comments

Why would YAKE not return the COVID-19 in any of the keywords in the following example:

occupational stress and mental health among anesthetists during the COVID-19 pandemic.

with default parameters, the output looks like this:

pandemic 0.04491197687864554
occupational stress 0.04940384002065631
stress and mental 0.09700399286574239
mental health 0.09700399286574239
health among anesthetists 0.09700399286574239
occupational 0.15831692877998726
stress 0.29736558256021506
mental 0.29736558256021506
health 0.29736558256021506
anesthetists 0.29736558256021506

Hi @gvalchca
This is current a limitation. It does not handle well enough tokens with special characters like - nor digits.

In this case I would recommend normalising all COVID-19 mentions to simply COVID and it will work just fine.

Ideally we should improve the algorithm to manage this better. If you have ideas, please send us a PR :)

Related issue and explanation by @rncampos here

Hi @gvalchca
This is not exposed by the API but you could play with DataCore's tagsToDiscard parameter. By default it ignores digits.

Further explanation can be found here

Hey, thanks for your answer and sorry to have duplicated the thread. However, the solution would not work for me cause in biology/medicine there are plenty of those abbreviations with meaningful numbers (e.g. IL2, IL6).