JuliaLang/julia

Segfault after BLAS.gemm!; dlopen_e; set_num_threads; BLAS.gemm!

danluu opened this issue · 5 comments

The following code segfaults, but not every time:

a = rand(100,100)
a*a
dlopen_e("wat")
blas_set_num_threads(9)
a*a

If you want to reproduce this, you can't do it by pasting this code into the REPL one line at a time, since waiting between blas_set_num_threads and the second a*a makes the segfault go away. However, it sometimes segfaults if you paste the whole thing in at once, or if you wrap the dlopen in a try/catch and run the whole thing. Also, I can only get this to segfault on my linux machine (64-bit, 3.13.0-35-generic), and not my mac (10.9.something).

I have a PR to openblas at OpenMathLib/OpenBLAS#447, but I don't actually understand the openblas api well enough to know if this is an openblas bug, or if it's a julia bug for violating api constraints.

If it's an openblas bug, then something (???) will have to be done to grab a new version once it's fixed. If not, I can debug this further after someone comes along and points out that you shouldn't do whatever is being done here.

I'm unable to reproduce this with 0.4.

can reproduce on master:

julia> include("/home/kristoffer/Documents/seg.jl")

signal (11): Segmentation fault

signal (11): Segmentation fault
while loading /home/kristoffer/Documents/seg.jl, in expression starting on line 5
[1]    22876 segmentation fault (core dumped)  ./julia

No segfault here. I've tried on my Mac (0.6.0-dev.2248) and Anubis (0.6.0-dev.1829). What is your platform?

I couldn't get this to crash on OS X either, but it does crash on Ubuntu 16.04.2 LTS on a quad-core (+ two-way hyperthreading) CPU (i7-6700K) in case that matters. It might: that's 8 hardware threads and it crashes with set_num_threads(9) but not with set_num_threads(8), which looks suspicious.

The code snippet needs to be modified slightly for Julia 0.5 or master:

using Base.Libdl
a = rand(100,100)
a*a
dlopen_e("wat")
BLAS.set_num_threads(9)
a*a

Without the "using" directive it obviously won't run, and without changing the BLAS call to an un-deprecated one it won't crash. Apparently the timing issue is sensitive enough that printing out the deprecation warning is too long a delay (--depwarn=no also "works", meaning that it produces the crashes).

It segfaults with both 0.5.1-pre+55 and master (2cd3ff7), but not every time. Also varying the number in set_num_threads(9) changes the results. I collected some crash counts within 100 runs into a table:

threads crashes with 0.5.1 crashes with master
6 0 0
7 0 0
8 0 0
9 80 23
10 99 67
11 71 69
12 78 68
20 89 75
100 87 72

I used the following Python 3 script for the numbers in the table (K is the table value):

import subprocess
cmd = '/path/to/julia'
k, K, n = 0, 0, 0
for i in range(100):
    out = subprocess.getoutput("%s seg.jl" % cmd)
    k += 'Segmentation fault' in out
    K += len(out.strip()) > 0 # for non-SEGV crashes
    n += 1
    print(k, K, n)

In one of the cases for master/100 a non-segfault crash occurred: Julia used up "300%" CPU as reported by top for 20 minutes, after which I killed it. Top reported 62.5 idle, i.e. 3/8 hyperthreads in use. Except for that table cell, k == K.

I tried on 0.7 master and 0.6 on a few different machines (mac and linux) and can't reproduce. Suggest reopening if people still observe this.