Segfault after BLAS.gemm!; dlopen_e; set_num_threads; BLAS.gemm!
danluu opened this issue · 5 comments
The following code segfaults, but not every time:
a = rand(100,100)
a*a
dlopen_e("wat")
blas_set_num_threads(9)
a*a
If you want to reproduce this, you can't do it by pasting this code into the REPL one line at a time, since waiting between blas_set_num_threads
and the second a*a
makes the segfault go away. However, it sometimes segfaults if you paste the whole thing in at once, or if you wrap the dlopen
in a try/catch
and run the whole thing. Also, I can only get this to segfault on my linux machine (64-bit, 3.13.0-35-generic), and not my mac (10.9.something).
I have a PR to openblas at OpenMathLib/OpenBLAS#447, but I don't actually understand the openblas api well enough to know if this is an openblas bug, or if it's a julia bug for violating api constraints.
If it's an openblas bug, then something (???) will have to be done to grab a new version once it's fixed. If not, I can debug this further after someone comes along and points out that you shouldn't do whatever is being done here.
I'm unable to reproduce this with 0.4.
can reproduce on master:
julia> include("/home/kristoffer/Documents/seg.jl")
signal (11): Segmentation fault
signal (11): Segmentation fault
while loading /home/kristoffer/Documents/seg.jl, in expression starting on line 5
[1] 22876 segmentation fault (core dumped) ./julia
No segfault here. I've tried on my Mac (0.6.0-dev.2248
) and Anubis (0.6.0-dev.1829
). What is your platform?
I couldn't get this to crash on OS X either, but it does crash on Ubuntu 16.04.2 LTS on a quad-core (+ two-way hyperthreading) CPU (i7-6700K) in case that matters. It might: that's 8 hardware threads and it crashes with set_num_threads(9)
but not with set_num_threads(8)
, which looks suspicious.
The code snippet needs to be modified slightly for Julia 0.5 or master:
using Base.Libdl
a = rand(100,100)
a*a
dlopen_e("wat")
BLAS.set_num_threads(9)
a*a
Without the "using" directive it obviously won't run, and without changing the BLAS call to an un-deprecated one it won't crash. Apparently the timing issue is sensitive enough that printing out the deprecation warning is too long a delay (--depwarn=no also "works", meaning that it produces the crashes).
It segfaults with both 0.5.1-pre+55 and master (2cd3ff7), but not every time. Also varying the number in set_num_threads(9)
changes the results. I collected some crash counts within 100 runs into a table:
threads | crashes with 0.5.1 | crashes with master |
---|---|---|
6 | 0 | 0 |
7 | 0 | 0 |
8 | 0 | 0 |
9 | 80 | 23 |
10 | 99 | 67 |
11 | 71 | 69 |
12 | 78 | 68 |
20 | 89 | 75 |
100 | 87 | 72 |
I used the following Python 3 script for the numbers in the table (K is the table value):
import subprocess
cmd = '/path/to/julia'
k, K, n = 0, 0, 0
for i in range(100):
out = subprocess.getoutput("%s seg.jl" % cmd)
k += 'Segmentation fault' in out
K += len(out.strip()) > 0 # for non-SEGV crashes
n += 1
print(k, K, n)
In one of the cases for master/100 a non-segfault crash occurred: Julia used up "300%" CPU as reported by top for 20 minutes, after which I killed it. Top reported 62.5 idle, i.e. 3/8 hyperthreads in use. Except for that table cell, k == K
.
I tried on 0.7 master and 0.6 on a few different machines (mac and linux) and can't reproduce. Suggest reopening if people still observe this.