Deduplication threshold changes the order of the response tuples

Question

Deduplication threshold changes the order of the response tuples

josemarcosrf opened this issue 2 years ago · 2 comments

I've noticed the following behavior of the .extract_keywords function:

When using a deduplication threshold (dedupLim) lower than 1, the response tuples are of the form (word, score). e.g.:

('non-profit', 0.18087033619667015)
('social', 0.21178928326651927)
('media', 0.21178928326651927)
('handle', 0.28189161752425324)

However, when equal or greater than 1, becomes:

(0.18087033619667015, 'non-profit')
(0.21178928326651927, 'social')
(0.21178928326651927, 'media')
(0.28189161752425324, 'handle')

Below the sample code which produces the above outputs:

import yake

text = 'I handle social media for a non-profit. Should I start going to social media networking events? Are there any good ones in the bay area?'

kw_extractor = yake.KeywordExtractor(lan="en", n=1, dedupLim=1, top=4, features=None)
keywords = kw_extractor.extract_keywords(text)
for kw in keywords:
    print(kw)

The issue seems to stem from the difference between these two lines: yake.py#L71 and yake.py#L85

Happy to submit a PR to fix it if is of any help

Answer 1 · 2022-10-17T19:50:26.000Z

For reference: This commit 16698b9 and PR #66 solve the issue

Answer 2 · 2022-10-19T21:12:13.000Z

Thank you for the help @jmrf. We included Jake's PR.