stack overflow warnings/errors when comparing large(ish?) vectors of integers
Opened this issue · 1 comments
Running over vectors of 100k integers produces stack imbalance warnings at best and aborts the R session at worst:
# Two vectors of 100k random integers 1-12
d1 <- sample(1:12, 100000, replace = TRUE)
d2 <- sample(1:12, 100000, replace = TRUE)
# Compare
v <- stringdist::stringdist(d1, d2, method = "dl")
> Warning: stack imbalance in '<-', 2 then 21342
Attempting three of these (for three date components) within a function aborts the session, with
Error: protect(): protection stack overflow
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
This problem can be sidestepped by specifying nthread = 1
. Default value for get_option("sd_num_thread")
for me is 7.
sessionInfo:
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] data.table_1.14.8
loaded via a namespace (and not attached):
[1] compiler_4.2.0 cli_3.6.1 parallel_4.2.0 tools_4.2.0
[5] jsonlite_1.8.5 rlang_1.1.1 renv_0.17.3 stringdist_0.9.10
Confirmed, that seems to be a bug. Seems to be independent of the chosen distance.
Edit. A bit confusing because I have worked with stringdist on millions of records before.
Edit. The bug is irreproducible. Running the following script with R -f
multiple times sometimes gives a stack imbalance, sometimes not.
library(stringdist)
set.seed(1)
n <- 1000
x <- sample(0:9, size=n, replace=TRUE)
y <- sample(0:9, size=n, replace=TRUE)
out <- stringdist(x,y, method="osa", nthread=2)
It does not seem to occur with nthread=1
Edit As stated in the bugreport: this only occurs when stringdist
is provided an integer vector. Which is weird because stringdist
does not do anything special there: stringdist
casts all input to character
before any further processing. Even adding a single "a"
to x
and y
in the above script prevents the warning.