edwindj/ffbase

ffdfdply returns NA if only one split exists in the data

Closed this issue · 1 comments

When applying a function to an ffdf using ffdfdply, if the split parameter has only one level, the split level is returned as NA and seemingly all functions read the data as NA. I assume that ffdfdply should work with one split as well as multiple splits? For example:

rastVals <- 1:10000
zoneVals <- rep(1, 10000)

vals <- ff::ff(initdata = rastVals, finalizer = "delete", overwrite = T)
zones <- ff::ff(initdata = zoneVals, finalizer = "delete", overwrite = T)
rDT <- ff::ffdf(zones, vals)

result <- ffbase::ffdfdply(x=rDT,
                           split=as.character(zones),
                           trace=TRUE,
                           BATCHBYTES = 80.85*2^20,
                           FUN = function(dta){
                             ## This happens in RAM - containing **several** split 
                             #elements so here we can use data.table which works 
                             #fine for in RAM computing
                             dta <- data.table::as.data.table(dta)
                             
                             #calc aggregations
                             result <- dta[, as.list(unlist(lapply(.SD, function(x) list(sum=sum(x, na.rm = TRUE), mean=mean(x, na.rm = TRUE))))), by=zones]
                             
                             as.data.frame(result)
                           })

returns the following result:

> result
ffdf (all open) dim=c(1,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
          PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones            zones       double        double FALSE           FALSE            FALSE                 1                1
vals.sum      vals.sum       double        double FALSE           FALSE            FALSE                 2                1
vals.mean    vals.mean       double        double FALSE           FALSE            FALSE                 3                1
          PhysicalLastCol PhysicalIsOpen
zones                   1           TRUE
vals.sum                1           TRUE
vals.mean               1           TRUE
ffdf data
  zones vals.sum vals.mean
1    NA        0       NaN

This is easily checked by adding a second split in the zones e.g. changing the zoneVals entries to
zoneVals <- c(rep(1, 5000), rep(2, 5000))
gives the following result:

> result
ffdf (all open) dim=c(2,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
          PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones            zones       double        double FALSE           FALSE            FALSE                 1                1
vals.sum      vals.sum       double        double FALSE           FALSE            FALSE                 2                1
vals.mean    vals.mean       double        double FALSE           FALSE            FALSE                 3                1
          PhysicalLastCol PhysicalIsOpen
zones                   1           TRUE
vals.sum                1           TRUE
vals.mean               1           TRUE
ffdf data
       zones   vals.sum  vals.mean
1        1.0 12502500.0     2500.5
2        2.0 37502500.0     7500.5

Did I miss something in how ffdfdply works? I can calculate one split directly, sure, but when working with dynamic aggregation it makes sense for ffdfdply to handle multiple and single splits?

You are completely right. Thanks for reporting. I've updated the package with a small fix to also allow to have only 1 split level.
Although ffdfdply is intended to be used if you have a lot more split levels to get groups of split levels in RAM, indeed if you have only 1 split level, it now also works correctly.
Thanks again.