Deduplication threshold changes the order of the response tuples
josemarcosrf opened this issue · 2 comments
josemarcosrf commented
I've noticed the following behavior of the .extract_keywords
function:
When using a deduplication threshold (dedupLim
) lower than 1, the response tuples are of the form (word, score)
. e.g.:
('non-profit', 0.18087033619667015)
('social', 0.21178928326651927)
('media', 0.21178928326651927)
('handle', 0.28189161752425324)
However, when equal or greater than 1, becomes:
(0.18087033619667015, 'non-profit')
(0.21178928326651927, 'social')
(0.21178928326651927, 'media')
(0.28189161752425324, 'handle')
Below the sample code which produces the above outputs:
import yake
text = 'I handle social media for a non-profit. Should I start going to social media networking events? Are there any good ones in the bay area?'
kw_extractor = yake.KeywordExtractor(lan="en", n=1, dedupLim=1, top=4, features=None)
keywords = kw_extractor.extract_keywords(text)
for kw in keywords:
print(kw)
The issue seems to stem from the difference between these two lines: yake.py#L71 and yake.py#L85
Happy to submit a PR to fix it if is of any help
josemarcosrf commented
arianpasquali commented
Thank you for the help @jmrf. We included Jake's PR.