explosion/spaCy

stopwords

rajhans opened this issue · 7 comments

I have observed that spacy considers many common verbs like 'call' also as stopwords (as indicated by IS_STOP) which is a little out of ordinary. Is there any information that describes how spacy determines stopwords? Is there a way to get change the stopword criteria?

ines commented

Thanks! I'm actually in the process of finally reorganising the language data, so there will be an update soon that fixes this problem, among other things.

We're not very happy with the current stopword lists (or most other standard stopword lists that are available tbh). They're outdated and full of pre-processing artifacts, custom hacks and other stuff that's not relevant for spaCy (like don for "don't" etc.)

It's probably okay for information extraction, but not very useful for Machine Learning at the moment. So we want to use a slightly different and non-standard approach to determine what spaCy considers a stopword and how the language data is organised in the codebase.

We're always happy about input and suggestions – although there obviously won't be a 100% perfect solution, because in the end, it's always sort of arbitrary.

In the meantime, here's how you can customise the stopword behaviour. You can set attributes in the vocabulary, and tokens will inherit these attributes:

lex = nlp.vocab[u'call']
lex.is_stop = False
doc = nlp(u'Call me!')
[(w.text, w.is_stop) for w in doc]
# (u'Call', False), (u'me', True), (u'!', False)]

It would be helpful if the docs eventually included an explanation of the decision-making that went into whichever words end up being considered stopwords.

In my experience, it's better to err on the side of fewer than more for stopwords, and get a linguist's input (the NLTK list is actually pretty decent starting place, notwithstanding some of its flaws). You've shown that it's easy to customise stopword behaviour, so stopword-ifying e.g. very frequent words should be straightforward.

+1 to fmailhot's comment. An explanation of stopwords decision would be helpful and (IMO or at least for my case) it is probably better to err on the conservative side when labeling stopwords as for most applications it is easier for users to explicitly label what they consider as stopwords (e.g. company names in a company corpora) than to explicitly 'unlist' words from stopwords.

So where is the explanation/justification for the stopword list? This got closed so I assume the explanation was written somewhere. There are some words in there that don't make sense like 'call' and 'well'. I think it could use some improvement.

I am also interested since the list seems to be multiple times the size of the nltk list. I'm not sure where the explanation is, but I thought it might be helpful to link to the English stop words list currently employed.

@nateGeorge Actually, after digging through the git history, it looks like the list may have came from Stone, Dennis, Kwantes (2010) as seen in this line from the repository.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.