Integer overflow in `MatrixRow` offset calculations for large matrices
LTLA opened this issue · 8 comments
There is a bug in accessing the elements of a MatrixRow
of a large matrix:
b <- Rcpp::sourceCpp(code="
#include \"Rcpp.h\"
//[[Rcpp::export(rng=false)]]
double get_bottom_right(Rcpp::NumericMatrix x) {
auto row = x.row(x.nrow() - 1);
return row[x.ncol() - 1];
}
")
# Okay:
mat1 <- matrix(1:20000 + 0.0, nrow=20000, ncol=1e5)
get_bottom_right(mat1)
## [1] 20000
mat1[20000,1e5]
## [1] 20000
# Incorrect:
mat2 <- matrix(1:30000 + 0.0, nrow=30000, ncol=1e5)
get_bottom_right(mat2)
## [1] 13216
mat2[30000,1e5]
## [1] 30000
This is because the row[]
indexing is done through the usual offset calculation:
Rcpp/inst/include/Rcpp/vector/MatrixRow.h
Lines 171 to 173 in ae1eccc
but both i
and parent_nrow
are (32-bit) int
s:
Rcpp/inst/include/Rcpp/vector/MatrixRow.h
Line 206 in ae1eccc
which causes an integer overflow when computing i * parent_nrow
for large matrices. This causes us to access some random memory address, leading to the incorrect result observed above. We have also observed segfaults from similar code on other machines, though I don't have the details of those systems.
The fix seems relatively simple: just cast parent_nrow
to a R_xlen_t
before multiplication. This should cause promotion upon multiplication and avoid any issues, given that Matrix::operator[]
already accepts a R_xlen_t
.
Seems likely that MatrixColumn
will need a similar fix in its constructor:
Rcpp/inst/include/Rcpp/vector/MatrixColumn.h
Line 132 in ae1eccc
Session information
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.3.0 tools_4.3.0 Rcpp_1.0.11
Compiler information
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
That is possible. We extended to R_xlen_t
in a number of places years ago but may have missed this. Do you think you have a moment to dive into a PR for this now-identified case?
(But yikes why are you on Ubuntu 18.04 with gcc/g++ 7.5.0 ?)
Do you think you have a moment to dive into a PR for this now-identified case?
I'll try to make some time on Sunday if someone else doesn't get to it first.
(But yikes why are you on Ubuntu 18.04 with gcc/g++ 7.5.0 ?)
Had to use an old HPC to get enough memory for this example... though I must also admit that I only recently updated from 18.04 on my personal laptop because I forgot to enable LTS upgrade notifications.
Alright, slapped a PR together. Not sure how to actually test it, though, unless your CI system has >16 GB of RAM to spare.
I'm happy to test normally. I have more here and could use AWS. But good point: CI at GitHub will likely need to skip such tests.
I did actually think about this a little over the last few days and my memory is that for matrices we did not switch to R_xlen_t
because while vectors had the constrainted remove on indices, for matrices it was their total element number.
Does the (clean and simple) test you have showing a weakness in Rcpp types actually work with base R?
Is the example in my original post sufficient? i.e. indexing in base R works correctly for mat2
, but not via Rcpp:
# Incorrect: mat2 <- matrix(1:30000 + 0.0, nrow=30000, ncol=1e5) get_bottom_right(mat2) ## [1] 13216 mat2[30000,1e5] ## [1] 30000
I had been under the impression that it was possible to create R matrices up to .Machine$integer.max
-by-.Machine$integer.max
, so it should be possible to subscript them with indices up to .Machine$integer.max
.
Note that at least someone else had the same idea in MatrixColumn
, I guess they just missed a few spots:
Yes, if this works with mat2
then Rcpp should absolutely fix this. I'll start a reverse depends run with your PR, it sadly takes a good moment on an old machine.
And as mentioned, I think this either may have forgotten in the earlier R_xlen_t
conversion PR years ago (that was super useful) or we were mistaken in believing R didn't didn't do that part either. (I think I did not have easy access to that much RAM then.) So thanks so much for the PR. Will follow-up.
Fixed by #1281 a couple of days ago. Thanks again for the PR!