use of custom stop words
gboyega1 opened this issue · 2 comments
Using it with the KeyBert library and utilizing a list of custom stop words doesn't appear to have any impact.
no custom stop word list
vectorizer = KeyphraseCountVectorizer()
kw_model.extract_keywords(strip_html(course[2]), vectorizer=vectorizer, top_n = 15, use_mmr = True, diversity = 0.45)
output
[('hrm msc students', 0.5963), ('human resources', 0.5466), ('organisational development', 0.505), ('student experience', 0.4352), ('business school', 0.4336), ('people profession', 0.4273), ('pwc staff', 0.416), ('london offices', 0.4049), ('research leaders', 0.3931), ('professional stream skills workshop satisfy requirements', 0.3907), ('quality education', 0.3792), ('cipd accreditation', 0.3522), ('dissertation', 0.3428), ('relevant programmes', 0.3143), ('edge practice', 0.2627)]
including a custom stop word list to discard 'msc'
vectorizer = KeyphraseCountVectorizer(stop_words = stpwrds)
kw_model.extract_keywords(strip_html(course[2]), vectorizer=vectorizer, top_n = 15, use_mmr = True, diversity = 0.45)
output produces same keyphrases with identical importance
[('hrm msc students', 0.5963), ('human resources', 0.5466), ('organisational development', 0.505), ('student experience', 0.4352), ('business school', 0.4336), ('people profession', 0.4273), ('pwc staff', 0.416), ('london offices', 0.4049), ('research leaders', 0.3931), ('professional stream skills workshop satisfy requirements', 0.3907), ('quality education', 0.3792), ('cipd accreditation', 0.3522), ('dissertation', 0.3428), ('relevant programmes', 0.3143), ('edge practice', 0.2627)]
Also note the inclusion of 'hrm msc students' despite having included msc as a stop word
Any help that can be provided about this would be greatly helpful
I have the same issue. It does not seem to be removing the custom stop words.
Solved with the v0.0.12
release.