HenrikBengtsson/matrixStats

Improving precision of rowCumsums() ?

Opened this issue · 7 comments

Compared with base::cumsum() rowCumsums() seems to be less precise with floating point operations. As an example - calculating cumulative sums of repeated fractions that all should sum to one in the end. Every time cumsum() from base R returns 1 as a final result. And most of the time the result of rowCumsums() is either slightly below or slightly above 1.

rowCumsums(t(1/rep(11, 11))) == 1
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

cumsum(t(1/rep(11, 11))) == 1
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

This can have practical consequences: recently I was trying to speed up kolmogorov-smirnov test calculations and the statistic requires to calculate cumulative sums. I tried using rowCumsums() but psmirnov() then frequently returned NA values. Trouble is - the statistic should be bounded between 0 and 1, and rowCumsums() gets numbers that are tiny bit below zero or above 1.

Thanks for the report. If you search this issue tracker, I think you'll find other examples. If you search the R help and devel mailing list, you'll find similar precision issues with base R functions.

We might be able to improve this, but there will always be precision issues because of floating-point arithmetic. The only safe way to compare to 1.0 is to compare with a tolerance. For example, all.equal() for numeric values has a tolerance = sqrt(.Machine$double.eps) argument. So, something like:

> tolerance <- sqrt(.Machine$double.eps)
> abs(rowCumsums(t(1/rep(11, 11))) - 1) < tolerance
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

I tried using rowCumsums() but psmirnov() then frequently returned NA values.

Can you please give a reproducible example with stats::psmirnov() here? It could be that stats::psmirnov() is problematic should be fixed.

For any developer/contribute who wish to look into this, the C-level source code is in https://github.com/HenrikBengtsson/matrixStats/blob/develop/src/rowCumsums_lowlevel_template.h. I think the culprit is that although we're adding the current cumulative sum with the next value using long double (LDOUBLE), we're rounding to double in each iteration:

ans[kk] = (ANS_C_TYPE) ((LDOUBLE) ans[kk_prev] + (LDOUBLE) xvalue);

This can probably be improved.

FWIW, the following patch makes no difference for this use case. It still gives:

> matrixStats::rowCumsums(t(1/rep(11, 11))) - 1.0
           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
[1,] -0.9090909 -0.8181818 -0.7272727 -0.6363636 -0.5454545 -0.4545455
           [,7]       [,8]       [,9]       [,10]        [,11]
[1,] -0.3636364 -0.2727273 -0.1818182 -0.09090909 2.220446e-16
$ git diff -w src/rowCumsums_lowlevel_template.h
diff --git a/src/rowCumsums_lowlevel_template.h b/src/rowCumsums_lowlevel_template.h
index 0260f92..29cb2bd 100644
--- a/src/rowCumsums_lowlevel_template.h
+++ b/src/rowCumsums_lowlevel_template.h
@@ -33,7 +33,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
   R_xlen_t ii, jj, kk, kk_prev, idx;
   R_xlen_t colBegin;
   X_C_TYPE xvalue;
-  LDOUBLE value;
+  LDOUBLE value, csum;
 
 #if ANS_TYPE == 'i'
   double R_INT_MIN_d = (double)R_INT_MIN,
@@ -86,7 +86,9 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
           ans[kk] = ANS_NA;
         }
 #else
-        ans[kk] = (ANS_C_TYPE) ((LDOUBLE) ans[kk_prev] + (LDOUBLE) xvalue);
+        if (kk_prev == 0) csum = (LDOUBLE) ans[0];
+        csum = csum + (LDOUBLE) xvalue;
+        ans[kk] = (ANS_C_TYPE) csum;
 #endif
 
         kk++;
@@ -127,7 +129,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
           ans[kk] = ANS_NA;
         }
 #else
-        value += xvalue;
+        value += (LDOUBLE) xvalue;
         ans[kk] = (ANS_C_TYPE) value;
 #endif

Thank you for such a quick reply. Here is a somewhat artificial demonstration with psmirnov().

a <- cumsum(t(rep(1/21, 21)))                                                                                                                                                                                                       
b <- rowCumsums(t(rep(1/21, 21)))
                                                           
all.equal(as.vector(a), as.vector(b))
[1] TRUE
                                                                                                                               
psmirnov(a[21], c(10,11))                                                                                                                                                                                                           
[1] 0.9999943                                                                                                                                                                                                                         
psmirnov(b[21], c(10,11))                                                                                                                                                                                                           
[1] NA

psmirnov() is quite a new function in R so might have problems. But somehow it never gives issues with base::cumsum:

aps <- bps <- numeric(100)
for(i in 5:100) {
  a <- cumsum(t(rep(1/i, i)))
  b <- rowCumsums(t(rep(1/i, i)))

  aps[i] <- psmirnov(a[i], c(floor(i/2),ceiling(i/2)))
  bps[i] <- psmirnov(b[i], c(floor(i/2),ceiling(i/2)))
}

sum(is.na(aps))  # how many NAs from psmirnov() with base::cumsum?
[1] 0
sum(is.na(bps))  # how many NAs from psmirnov() with rowCumsums?
[1] 39

8th line in psmirnov() probably makes the difference:

head(psmirnov, 10)
                                                                 
1  function (q, sizes, z = NULL, two.sided = TRUE, exact = TRUE,
2      simulate = FALSE, B = 2000, lower.tail = TRUE, log.p = FALSE) 
3  {
4      if (is.numeric(q))
5          q <- as.double(q)
6      else stop("argument 'q' must be numeric")
7      ret <- rep.int(0, length(q))
8      ret[is.na(q) | q < -1 | q > 1] <- NA
9      IND <- which(!is.na(ret))
10     if (!length(IND))

Maybe it should be fixed at psmirnov()side. But still it seems like, aside from the general issues of numerical precision in computing, base::cumsum and rowCumsums do not deviate randomly from one another, but rather base::cumsum seems to be more accurate.

More notes for clarifications. The same precision issue happens if we calculate the cumulative sum manually using vanilla addition in R;

my_cumsum <- function(x) {
  y <- double(length(x))
  y[1] <- x[1]
  for (kk in seq(from = 2, to =length(x))) y[kk] <- y[kk-1] + x[kk]
  y
}

For example:

x <- rep(1/11, times = 11L)

y1 <- cumsum(x)
y1 - 1
##  [1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455
##  [6] -0.45454545 -0.36363636 -0.27272727 -0.18181818 -0.09090909
## [11]  0.00000000

y2 <- my_cumsum(x)
y2 - 1
##  [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
##  [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11]  2.220446e-16

y2-y1
##  [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16


y3 <- as.vector(matrixStats::rowCumsums(x, dim. = c(1, length(x))))
y3 - 1
##  [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
##  [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11]  2.220446e-16
y3-y1
##  [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16
y3-y2
##  [1] 0 0 0 0 0 0 0 0 0 0 0

Since y1 != y2 and both can be considered correct ways of calculating the cumulative sum, I think one could argue that psmirnov() is too sensitive to "numeric noise".

Here's an alternative version that increase the precision by using sum() instead of +:

my_cumsum2 <- function(x) {
  y <- double(length(x))
  y[1] <- x[1]
  for (kk in seq(from = 2, to = length(x))) {
    tmp <- x[seq_len(kk)]
    y[kk] <- sum(tmp)
  }
  y
}
> y4 <- my_cumsum2(x)
> y4-1
 [1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455 -0.45454545
 [7] -0.36363636 -0.27272727 -0.18181818 -0.09090909  0.00000000
> y4-y1
 [1] 0 0 0 0 0 0 0 0 0 0 0