more stopwords support
Closed this issue · 2 comments
TatsuyaShirakawa commented
Thank you for providing a cool library!
I'm now trying to apply pke to japanese corpus but I couldn't find how to handle japanese stopwords.
Is there any good way to handle languages to which NLTK does not provide stopwords?
thanks.
ygorg commented
Hi, thx for the feedback ! and sorry for the long answer time !
Generally you can provide stopwords via the stoplist
parameter for every fucntion in utils
(except for utils.compute_lda_model
) and in candidate_filtering
.
I assume you use candidate_selection
. Depending on the extractor you use, stopwords might not even be used.
So instead of:
extractor = MultipartiteRank()
extractor.load_document(doc)
extractor.candidate_selection()
you can do
with open('my_stoplist') as f:
my_stoplist = [l.lower().strip() for l in f]
extractor = MultipartiteRank()
extractor.load_document(doc)
extractor.longest_pos_sequence_selection(['NOUN', 'ADJ', 'ADV'])
# or any other candidate selection function (extractor.ngram_selection(3))
extractor.candidate_filtering(stoplist=my_stoplist)
# you might need to change the parameters of this function to account for japanese words length
TatsuyaShirakawa commented
Hi @ygorg , thank you for your help! I'll try it.
Again, thank you for your cool lib.