RcppCore/Rcpp

Integer overflow in `MatrixRow` offset calculations for large matrices

LTLA opened this issue · 8 comments

LTLA commented

There is a bug in accessing the elements of a MatrixRow of a large matrix:

b <- Rcpp::sourceCpp(code="
#include \"Rcpp.h\"

//[[Rcpp::export(rng=false)]]
double get_bottom_right(Rcpp::NumericMatrix x) {
    auto row = x.row(x.nrow() - 1);
    return row[x.ncol() - 1];
}
")

# Okay:
mat1 <- matrix(1:20000 + 0.0, nrow=20000, ncol=1e5)
get_bottom_right(mat1)
## [1] 20000
mat1[20000,1e5]
## [1] 20000

# Incorrect:
mat2 <- matrix(1:30000 + 0.0, nrow=30000, ncol=1e5)
get_bottom_right(mat2)
## [1] 13216
mat2[30000,1e5]
## [1] 30000

This is because the row[] indexing is done through the usual offset calculation:

inline reference operator[]( int i ) const {
return parent[ row + i * parent_nrow ] ;
}

but both i and parent_nrow are (32-bit) ints:

int parent_nrow ;

which causes an integer overflow when computing i * parent_nrow for large matrices. This causes us to access some random memory address, leading to the incorrect result observed above. We have also observed segfaults from similar code on other machines, though I don't have the details of those systems.

The fix seems relatively simple: just cast parent_nrow to a R_xlen_t before multiplication. This should cause promotion upon multiplication and avoid any issues, given that Matrix::operator[] already accepts a R_xlen_t.

Seems likely that MatrixColumn will need a similar fix in its constructor:

const_start(parent.begin() + i *n)

Session information
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.3.0 tools_4.3.0    Rcpp_1.0.11
Compiler information

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

That is possible. We extended to R_xlen_t in a number of places years ago but may have missed this. Do you think you have a moment to dive into a PR for this now-identified case?

(But yikes why are you on Ubuntu 18.04 with gcc/g++ 7.5.0 ?)

LTLA commented

Do you think you have a moment to dive into a PR for this now-identified case?

I'll try to make some time on Sunday if someone else doesn't get to it first.

(But yikes why are you on Ubuntu 18.04 with gcc/g++ 7.5.0 ?)

Had to use an old HPC to get enough memory for this example... though I must also admit that I only recently updated from 18.04 on my personal laptop because I forgot to enable LTS upgrade notifications.

LTLA commented

Alright, slapped a PR together. Not sure how to actually test it, though, unless your CI system has >16 GB of RAM to spare.

I'm happy to test normally. I have more here and could use AWS. But good point: CI at GitHub will likely need to skip such tests.

I did actually think about this a little over the last few days and my memory is that for matrices we did not switch to R_xlen_t because while vectors had the constrainted remove on indices, for matrices it was their total element number.

Does the (clean and simple) test you have showing a weakness in Rcpp types actually work with base R?

LTLA commented

Is the example in my original post sufficient? i.e. indexing in base R works correctly for mat2, but not via Rcpp:

# Incorrect:
mat2 <- matrix(1:30000 + 0.0, nrow=30000, ncol=1e5)
get_bottom_right(mat2)
## [1] 13216
mat2[30000,1e5]
## [1] 30000

I had been under the impression that it was possible to create R matrices up to .Machine$integer.max-by-.Machine$integer.max, so it should be possible to subscript them with indices up to .Machine$integer.max.

Note that at least someone else had the same idea in MatrixColumn, I guess they just missed a few spots:

const_start(const_cast<const MATRIX&>(parent).begin() + static_cast<R_xlen_t>(i) * n)

Yes, if this works with mat2 then Rcpp should absolutely fix this. I'll start a reverse depends run with your PR, it sadly takes a good moment on an old machine.

And as mentioned, I think this either may have forgotten in the earlier R_xlen_t conversion PR years ago (that was super useful) or we were mistaken in believing R didn't didn't do that part either. (I think I did not have easy access to that much RAM then.) So thanks so much for the PR. Will follow-up.

Fixed by #1281 a couple of days ago. Thanks again for the PR!