quanteda/quanteda

strange behavior for remove argument to dfm

Closed this issue · 6 comments

With quanteda version 0.9.9.50:

> dfm("110th", remove=stopwords())
NULL

> "110th" %in% stopwords()
[1] FALSE

> dfm("110th")
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
       features
docs    110th
  text1     1

"110th" is not in the default list of English stopwords, therefore the dfm keeps "110th" as a feature. I don't think dfm behaves strangely in this case. Do you want to add "110th" as a stopword? Then the code would look like this:

dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))

> dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))
Document-feature matrix of: 1 document, 3 features (0% sparse).
1 x 3 sparse Matrix of class "dfmSparse"
features
docs    visitor day .
text1       1   1 1

If "110th" is not a stop word, then why does dfm("110th", remove=stopwords()) return NULL?

dfm("110th", remove=stopwords()) does not return NULL; it returns a dfmwhich in your case only includes 110th (see features) which occurs once in text1 (see docs). I think this is the correct behaviour. If you want to add stopwords, see my post above.

> dfm("110th", remove=stopwords()) 
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
       features
docs    110th
  text1     1

Maybe you fixed this issue in the development version, but it's there (at least on my system) in the latest release. The following is with R 3.4.0, in R Studio:

screen shot 2017-05-23 at 10 47 18 am

Just installed the latest development version of quanteda from github. This fixed the issue. Sorry for the spam.

Thanks! I just tried version 0.9.9.50 as well, and your problem occurs on my system too. It does not occur in the development version, so there seems to be a problem in the lastest CRAN release.