datashield/dsBase

subsetByClass and friends

Closed this issue · 1 comments

Hello,

I think I found a problem in the subsetByClassHelper functions. Either that or I'm missing something obvious. Here's a little test case:

opals <- datashield.login(logindata)
# a factor with 7 levels:
datashield.assign(opals, 'fact', quote(rep(c('a','b', 'c', 'd', 'e', 'f', 'g'),10)))
ds.asFactor('fact', 'fact')
#check:
ds.levels('fact')
ds.subsetByClass('fact')
ds.summary('subClasses')

And the result:

....
[1] "fact.level_a_EMPTY" "fact.level_b_EMPTY" "fact.level_c_EMPTY" "fact.level_d_EMPTY" "fact.level_e_EMPTY" "fact.level_f_EMPTY" "fact.level_g_EMPTY"

I had a quick look in the code and I see this in all subsetByClassHelpers:

...
        for (j in 1:length(categories)) {
            indices <- which(var == as.numeric(categories[j]))
...

Why as.numeric? I really didn't look too closely but if I remove as.numeric it seems to work, it populates the subsets. Am I missing something here about how R handles factors?

Thanks,
Iulian

Hi Iulian,

Thank you for the spot. Usually R does not operate calculations with factors properly. Here is a simple example:

seed <- 10
a <- rbinom(100, 2, 0.3)
a.f <- as.factor(a)
a.n <- as.numeric(a)
a.n^2 # returns the squares of the elements
a.f^2 # returns NAs

For that reason we used to convert factors to numeric for any calculations and then convert the results back to factors. But you are right that in this specific function the use of as.numeric is not necessary. I will go through the code and make any appropriate modifications for the next release.

Thank you very much. Any comments are very useful for us in order to improve our codes.

Many thanks,
Demetris