Stopwords being ignored
chaturv3di opened this issue · 5 comments
I am passing the set of English stopwords which I create from yake/StopwordsList/stopwords_en.txt
.
text = "YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app."
language = "en"
max_ngram_size = 5
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 5
# Location of the file downloaded from https://github.com/LIAAD/yake/blob/master/yake/StopwordsList/stopwords_en.txt
stopwords_file = os.path.join(home_dir, "data_txt", "yake_stopwords_en.txt")
with open(stopwords_file, 'r') as sw_f:
yake_stopwords = set(sw_f.read().lower().split("\n"))
yake_kw_extractor = yake.KeywordExtractor(lan=language,
n=max_ngram_size,
dedupLim=deduplication_thresold,
dedupFunc=deduplication_algo,
windowsSize=windowSize,
top=numOfKeywords,
features=None,
stopwords=yake_stopwords)
yake_kw_extractor.extract_keywords(text)
And the results end up containing stopwords like of
, a
, from
, etc.
[('trained on a particular set', -60.326928913747196),
('keywords of a text', -0.665864990295941),
('important keywords of a text', -0.31206738772455755),
('light-weight unsupervised automatic keyword extraction', 0.00029233948201177757),
('statistical features extracted from single', 0.0008477866813335354)]
If I invoke the method with parameter stopwords=None
, the results don't change. Am I doing something silly here?
Thanks a lot.
I guess the stopwords-removing step is done in the last steps, i.e.:
- split words
- extract candidates
- score, dedup and remove stopwords.
@chaturv3di I am running in the same issue, have you found a solution?
Unfortunately not.
Not sure if secsilm was referring to this, but I am thinking about using my stopwords as a postprocessing step outside of the Yake Class.
That's not elegant but works. Eg if I wanted up to 4 word phrases without stopwords, but if I were to remove stopwords in post processing, then I'd need to fetch up to 6 word phrases hoping that up to 2 of them will be stopwords. That is clunky and increases the compute time.
OTOH, there doesn't seem to be another option right now.