ffdfdply returns NA if only one split exists in the data
Closed this issue · 1 comments
When applying a function to an ffdf using ffdfdply, if the split parameter has only one level, the split level is returned as NA and seemingly all functions read the data as NA. I assume that ffdfdply should work with one split as well as multiple splits? For example:
rastVals <- 1:10000
zoneVals <- rep(1, 10000)
vals <- ff::ff(initdata = rastVals, finalizer = "delete", overwrite = T)
zones <- ff::ff(initdata = zoneVals, finalizer = "delete", overwrite = T)
rDT <- ff::ffdf(zones, vals)
result <- ffbase::ffdfdply(x=rDT,
split=as.character(zones),
trace=TRUE,
BATCHBYTES = 80.85*2^20,
FUN = function(dta){
## This happens in RAM - containing **several** split
#elements so here we can use data.table which works
#fine for in RAM computing
dta <- data.table::as.data.table(dta)
#calc aggregations
result <- dta[, as.list(unlist(lapply(.SD, function(x) list(sum=sum(x, na.rm = TRUE), mean=mean(x, na.rm = TRUE))))), by=zones]
as.data.frame(result)
})
returns the following result
:
> result
ffdf (all open) dim=c(1,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones zones double double FALSE FALSE FALSE 1 1
vals.sum vals.sum double double FALSE FALSE FALSE 2 1
vals.mean vals.mean double double FALSE FALSE FALSE 3 1
PhysicalLastCol PhysicalIsOpen
zones 1 TRUE
vals.sum 1 TRUE
vals.mean 1 TRUE
ffdf data
zones vals.sum vals.mean
1 NA 0 NaN
This is easily checked by adding a second split in the zones e.g. changing the zoneVals entries to
zoneVals <- c(rep(1, 5000), rep(2, 5000))
gives the following result:
> result
ffdf (all open) dim=c(2,3), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol
zones zones double double FALSE FALSE FALSE 1 1
vals.sum vals.sum double double FALSE FALSE FALSE 2 1
vals.mean vals.mean double double FALSE FALSE FALSE 3 1
PhysicalLastCol PhysicalIsOpen
zones 1 TRUE
vals.sum 1 TRUE
vals.mean 1 TRUE
ffdf data
zones vals.sum vals.mean
1 1.0 12502500.0 2500.5
2 2.0 37502500.0 7500.5
Did I miss something in how ffdfdply works? I can calculate one split directly, sure, but when working with dynamic aggregation it makes sense for ffdfdply to handle multiple and single splits?
You are completely right. Thanks for reporting. I've updated the package with a small fix to also allow to have only 1 split level.
Although ffdfdply is intended to be used if you have a lot more split levels to get groups of split levels in RAM, indeed if you have only 1 split level, it now also works correctly.
Thanks again.