`na.rm` Argument to Deal with Missing Values
my-R-help opened this issue · 5 comments
Something like this:
> library(zoo)
> ( x <- c(2, 4, 6, NA, 8) )
[1] 2 4 6 NA 8
> rollapplyr(x, 2, mean, na.rm = F) # same as `roll_mean(x, 2)`
[1] 3 5 NA NA
> rollapplyr(x, 2, mean, na.rm = T) # `roll_mean(x, 2, na.rm = T)`
[1] 3 5 6 8
Problem with sum
and na.rm=T
If such a na.rm
argument would be implemented, it might also be useful to add a special rolling sum version that, for a window with all-NAs, returns NA and not zero. The code below illustrates:
library(zoo)
x <- c(2, 4, NA, NA, 8)
rollapplyr(x, 2, sum, na.rm = T)
[1] 6 4 0 8
For many applications, I think it would be more natural to have the following output in this case:
[1] 6 4 NA 8
Background Information
This is because (as explained in ?sum
) the sum of an empty set is zero.
sum(NA, na.rm = T)
[1] 0
However, for many applications, I think it would be more natural for sum
to return NA
in this case. Here is an amended sum function that does what I have in mind:
s <- function(x, na.rm = FALSE) {
if (!na.rm) return(sum(x))
if (all(is.na(x))) {
o <- NA
class(o) <- class(x)
return(o)}
sum(x, na.rm = TRUE)}
Using this with zoo::rollapplyr
returns the desired result:
rollapplyr(x, 2, s, na.rm = T)
[1] 6 4 NA 8
I've added na.rm
, but haven't yet made a decision about the behaviour of sum
. I could imagine someone arguing for this behavior for all rolling functions, really.
Thanks, that was quick!
My reasoning for rolling sum
is that if it gives NA
, you can later convert the resulting NA
s back to zero if needed. However, if it returns zero (as base::sum
does), you don't know whether this zero is because the sum is really zero (e.g. 0+0
), or because it was all-NA
. In that sense, we are losing information that would be potentially valuable.
I think it might be a good idea to keep both options, i.e. the one that works as base::sum
and the one suggested by me above. Maybe add an argument to roll_sum
that switches between both behaviors, with the default being set consistent with base::sum
.
Now that the number of values used to calculate the return value for a given window can vary (if na.rm=T
) it would be helpful to have an access to the n
of a given window. This would also help solve the problem @my-R-help talked about. If the rolling function returns 0
when the input was all-NA
and in addition to that 0
for n
, you could change the return value to NA
yourself if you wish to do so by looking at the n
. Maybe this could be optional with return.n=T
or similar, especially if it affects performance.