strange behavior for remove argument to dfm
Closed this issue · 6 comments
With quanteda version 0.9.9.50:
> dfm("110th", remove=stopwords())
NULL
> "110th" %in% stopwords()
[1] FALSE
> dfm("110th")
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
features
docs 110th
text1 1
"110th" is not in the default list of English stopwords, therefore the dfm
keeps "110th" as a feature. I don't think dfm
behaves strangely in this case. Do you want to add "110th" as a stopword? Then the code would look like this:
dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))
> dfm("He is the 110th visitor this day.", remove=c("110th", stopwords("english")))
Document-feature matrix of: 1 document, 3 features (0% sparse).
1 x 3 sparse Matrix of class "dfmSparse"
features
docs visitor day .
text1 1 1 1
If "110th" is not a stop word, then why does dfm("110th", remove=stopwords())
return NULL
?
dfm("110th", remove=stopwords())
does not return NULL
; it returns a dfm
which in your case only includes 110th
(see features) which occurs once in text1
(see docs). I think this is the correct behaviour. If you want to add stopwords, see my post above.
> dfm("110th", remove=stopwords())
Document-feature matrix of: 1 document, 1 feature (0% sparse).
1 x 1 sparse Matrix of class "dfmSparse"
features
docs 110th
text1 1
Just installed the latest development version of quanteda from github. This fixed the issue. Sorry for the spam.
Thanks! I just tried version 0.9.9.50 as well, and your problem occurs on my system too. It does not occur in the development version, so there seems to be a problem in the lastest CRAN release.