boudinfl/pke

more stopwords support

Closed this issue · 2 comments

Thank you for providing a cool library!

I'm now trying to apply pke to japanese corpus but I couldn't find how to handle japanese stopwords.
Is there any good way to handle languages to which NLTK does not provide stopwords?

thanks.

ygorg commented

Hi, thx for the feedback ! and sorry for the long answer time !
Generally you can provide stopwords via the stoplist parameter for every fucntion in utils (except for utils.compute_lda_model) and in candidate_filtering.
I assume you use candidate_selection. Depending on the extractor you use, stopwords might not even be used.
So instead of:

extractor = MultipartiteRank()
extractor.load_document(doc)
extractor.candidate_selection()

you can do

with open('my_stoplist') as f:
    my_stoplist = [l.lower().strip() for l in f]

extractor = MultipartiteRank()
extractor.load_document(doc)
extractor.longest_pos_sequence_selection(['NOUN', 'ADJ', 'ADV'])
# or any other candidate selection function (extractor.ngram_selection(3))
extractor.candidate_filtering(stoplist=my_stoplist)
# you might need to change the parameters of this function to account for japanese words length

Hi @ygorg , thank you for your help! I'll try it.
Again, thank you for your cool lib.