fabianvf/python-rake

Add More Stopword Lists

Closed this issue · 8 comments

After the current round of PRs are worked out, we should build in more stop words. I vote adding all the ones here, along with any others asked for: http://www.ranks.nl/stopwords . Also @fabianvf one of these is what I used as a test file, think that'll cause a problem?

hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url...

Well, if you went the URL route I'd thought you'd provide a URL and separation regex, so like

RAKE.load_stopwords('http://example.com/beststopwords', re.compile('super-cool-regex'))

so it wouldn't matter how they formatted it so long as it was a list of some kind. Just feel like it would be convenient, especially if you were just hacking/prototyping and wanted to experiment with different stoplists, without requiring you to download/format them manually.

Interesting. You may be right that that's a useful feature and I don't see it, but I've never seen someone who wanted to do that as a data scientist. Also it'd require more than just a regex for the vast majority of sites--it'd require playing around in beautiful soup or something too. The way I've seen everyone do it because it's always been the fastest has been to copy and paste into ipython and do some quick for loop.

It looks like this project has amassed a large group of stopwords lists from a variety of sources, do you think we could leverage this work?
https://github.com/igorbrigadir/stopwords

For posterities sake:

Hi Justin,

Thanks for asking.
Yes you can use our stopword lists if you credit 'ranks.nl'

Does your script work with HTML documents or text without markup only ?

If HTML, I'm curious if you've had a chance to test the results from the Page Analyzer tool on ranks.nl ?
It is basically a tool for Automatic Keyword Extraction from Individual HTML Documents.

Kind regards,
Damian Doyle
Ranks NL

On Tue, Aug 1, 2017 at 10:02 PM, Justin Terry justinkterry@gmail.com wrote:
Hello, I'm working on an MIT licensed open source natural language processing tool in python: https://github.com/fabianvf/python-rake

Can I include your stop word lists into the package by default if I credit you?

--
Thank you for your time,
Justin Terry

@fabianvf please close this, I fixed this in my last PR that you merged and forgot to mention it.

nevermind apparnetly i can now