Vectors with large n elements failing for stringdistmatrix() when method = cosine
Closed this issue ยท 9 comments
When I and my colleague pass vectors with > 7k elements to stringdistmatrix using the cosine method R crashes completely throwing a segfault error which says some memory did not map. On my mac. Here's the traceback and error:
*** caught segfault ***
address 0xbc9900000, cause 'memory not mapped'
Traceback:
1: .Call("R_lower_tri", a, methnr, as.double(weight), as.double(p), as.integer(q), as.integer(useBytes), as.integer(nthread))
2: lower_tri(a, method = method, useBytes = useBytes, weight = weight, useNames = useNames, nthread = nthread)
3: stringdistmatrix(path.exitURL$exitPagePath_TermPretty, method = "cosine")
4: eval(expr, envir, enclos)
5: eval(ei, envir)
6: withVisible(eval(ei, envir))
7: source("code/path_analysis/cluster_path.R")
Mac system info:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,5
Processor Name: Intel Core i7
Processor Speed: 2.5 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB
Boot ROM Version: MBP114.0172.B09
SMC Version (system): 2.30f2
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.5.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
It also fails on a fresh ubuntu and R installation. System info:
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
*-memory
description: System memory
physical id: 0
size: 29GiB
*-cpu
product: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
vendor: Intel Corp.
physical id: 1
bus info: cpu@0
width: 64 bits
When running this example from another issue on the command line in ubuntu the only message I get is Killed
:
many_words <- sapply(1:30000, function(x) paste(sample(letters, 10, replace=T),
collapse=""))
stringdist::stringdistmatrix(many_words, method = 'cosine')
Thanks for reporting. I can reproduce it. I also tried it with the 'soundex' algorithm, and this fails normally because of memory limitations on my laptop. It also crashes when method='qgram' so it appears that somewhere an out-of-memory access takes place during q-gram computation. I'll look into it.
Hi there, I have done some testing and I for now there seems to be a workaround. If you do
many_words <- sapply(1:30000, function(x) paste(sample(letters, 10, replace=T),
collapse=""))
stringdist::stringdistmatrix(many_words,many_words, method = 'cosine')
The whole matrix gets computed.
On my laptop, things start failing from around vectors of length 7K. If I compute the whole matrix, there's no problem. Twice the time and twice the memory, but at least you might continue with your work now.
Still looking for the real culprit..
Awesome, workarounds are fine by me while you do your detective work. Thanks!
Connor, it should be fixed now. If you install the latest version from my drat repo (see README.md on the front page) you can try the fix. I get no segfault with vectors of ~7k on my box now, but it would be great if you can confirm the fix (this is not exactly a test I can include in the test suite..).
Unfortunately, it is now returning an empty object on my mac.
That's ๐ . I'll check again. (I did not let the computation run all the way since it started taking a long time on my laptop). Thanks for the check!
Drat version crashed with the above example:
*** caught segfault ***
address 0x5b6076e000, cause 'memory not mapped'
Traceback:
1: .Call("R_lower_tri", a, methnr, as.double(weight), as.double(p), as.integer(q), as.integer(useBytes), as.integer(nthread))
2: lower_tri(a, method = method, useBytes = useBytes, weight = weight, useNames = useNames, nthread = nthread)
3: stringdist::stringdistmatrix(many_words, method = "cosine")
An irrecoverable exception occurred. R is aborting now ...
Will try it on linux though...
linux silently crashes when issued this command:
sudo Rscript -e "many_words <- sapply(1:10000, function(x) paste(sample(letters, 10, replace=T),collapse=""));stringdist::stringdistmatrix(many_words, method = 'cosine')"
Thanks for the heads up. Perhaps something went wrong when I wanted to upload to 'drat'. Will check (but later :)).
Ok, I fixed this on my way to Budapest for the satRday meeting (traveling is good for code quality :-)). Unfortunately it made another bug reappear, which I fixed only now.
The point with #46 was that when I compute q-grams I try to reuse the memory to store them. However, since I did not pass the correct structure to the internal q-gram generator, new memory was allocated for each combination of strings. The segfault occurred because you run out of memory at some point (I use a custom doubling allocator) and memory is only freed at the end of the call when all strings have been processed. This is now fixed. This also means that you'll get a pretty good speedup.
New version is on my drat repo now.