strange behavior for remove argument to dfm

Question

strange behavior for remove argument to dfm

Closed this issue 7 years ago · 6 comments

With quanteda version 0.9.9.50:

> dfm("110th", remove=stopwords())
NULL

> "110th" %in% stopwords()
[1] FALSE

> dfm("110th")
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
       features
docs    110th
  text1     1

Answer 1 · 2017-05-23T14:36:25.000Z

"110th" is not in the default list of English stopwords, therefore the dfm keeps "110th" as a feature. I don't think dfm behaves strangely in this case. Do you want to add "110th" as a stopword? Then the code would look like this:

dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))

> dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))
Document-feature matrix of: 1 document, 3 features (0% sparse).
1 x 3 sparse Matrix of class "dfmSparse"
features
docs    visitor day .
text1       1   1 1

Answer 2 · 2017-05-23T14:37:51.000Z

If "110th" is not a stop word, then why does dfm("110th", remove=stopwords()) return NULL?

Answer 3 · 2017-05-23T14:44:24.000Z

dfm("110th", remove=stopwords()) does not return NULL; it returns a dfmwhich in your case only includes 110th (see features) which occurs once in text1 (see docs). I think this is the correct behaviour. If you want to add stopwords, see my post above.

> dfm("110th", remove=stopwords()) 
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
       features
docs    110th
  text1     1

Answer 4 · 2017-05-23T14:50:06.000Z

Maybe you fixed this issue in the development version, but it's there (at least on my system) in the latest release. The following is with R 3.4.0, in R Studio:

Answer 5 · 2017-05-23T14:55:18.000Z

Just installed the latest development version of quanteda from github. This fixed the issue. Sorry for the spam.

Answer 6 · 2017-05-23T14:55:29.000Z

Thanks! I just tried version 0.9.9.50 as well, and your problem occurs on my system too. It does not occur in the development version, so there seems to be a problem in the lastest CRAN release.