boudinfl/pke

SnowballStemmer and Spacy-model use different langcodes

MarvinYork opened this issue · 1 comments

Hi, I recently starting using this package and I have come across an issue.
In my code below you can see that I am using the basic structure with the addition of language detection (since the language of my documents can be German or English) and 'stemming' in the extractor.load_document() function enabled.
The problem is that when a German document is detected (output of detect(filecontent) is 'de') 'de' is passed into the load_document() function and a stemming error occurs, since the SnowballStemmer doesn't detect 'de' as German (it uses 'ge' for German). Even turning normalization to 'none' didn't solve this issue.
So in order to fix that issue I tried to change 'de' to 'ge' before passing it as an argument into the load_document() function.
But that causes a different error since "there is no spacy-model for 'ge' language".
Ultimately my solution was to go into the lang.py file of the PKE package and change the langcode for German from "ge" : "german"
to "de" : "german". With this change I was able to pass 'de' as an argument into the load_document() function with stemming enabled and no further issues.
I hope that I described the issue clearly. If I made a programming mistake, please let me know.
Also thank you for providing this package, it is very useful :)

import pke
from langdetect import detect

#scan for language, filecontent = content of a .txt-file that I want to extract keywords from
lang = detect(filecontent)

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be a simple
# test string and preprocessing is carried out using spacy
extractor.load_document(input=filecontent, language=lang, normalization='stemming')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. (Noun|Adj)*)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

ygorg commented

Hi, thanks for this issue, it is linked to #215 #216 #219
It is now fixed in #225