TimSchopf/KeyphraseVectorizers

use of custom stop words

gboyega1 opened this issue · 2 comments

Using it with the KeyBert library and utilizing a list of custom stop words doesn't appear to have any impact.

no custom stop word list

vectorizer = KeyphraseCountVectorizer()
kw_model.extract_keywords(strip_html(course[2]), vectorizer=vectorizer, top_n = 15, use_mmr = True, diversity = 0.45)

output

[('hrm msc students', 0.5963), ('human resources', 0.5466), ('organisational development', 0.505), ('student experience', 0.4352), ('business school', 0.4336), ('people profession', 0.4273), ('pwc staff', 0.416), ('london offices', 0.4049), ('research leaders', 0.3931), ('professional stream skills workshop satisfy requirements', 0.3907), ('quality education', 0.3792), ('cipd accreditation', 0.3522), ('dissertation', 0.3428), ('relevant programmes', 0.3143), ('edge practice', 0.2627)]

including a custom stop word list to discard 'msc'

vectorizer = KeyphraseCountVectorizer(stop_words = stpwrds)
kw_model.extract_keywords(strip_html(course[2]), vectorizer=vectorizer, top_n = 15, use_mmr = True, diversity = 0.45)

output produces same keyphrases with identical importance

[('hrm msc students', 0.5963), ('human resources', 0.5466), ('organisational development', 0.505), ('student experience', 0.4352), ('business school', 0.4336), ('people profession', 0.4273), ('pwc staff', 0.416), ('london offices', 0.4049), ('research leaders', 0.3931), ('professional stream skills workshop satisfy requirements', 0.3907), ('quality education', 0.3792), ('cipd accreditation', 0.3522), ('dissertation', 0.3428), ('relevant programmes', 0.3143), ('edge practice', 0.2627)]

Also note the inclusion of 'hrm msc students' despite having included msc as a stop word

Any help that can be provided about this would be greatly helpful

I have the same issue. It does not seem to be removing the custom stop words.

Solved with the v0.0.12 release.