sklearn/gaussian_process/tests/test_gpr.py:test_sample_statistics segfaults with libopenblas 0.3.10
Closed this issue · 13 comments
Steps to reproduce:
conda create -n cf -y -c conda-forge cython pillow numpy scipy pytest joblib threadpoolctl
conda activate cf
pip install -e . --no-build-isolation
Then:
$ pytest -vlk test_sample_statistics sklearn/gaussian_process
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.8.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1 -- /home/ogrisel/miniconda3/envs/cf/bin/python
cachedir: .pytest_cache
rootdir: /home/ogrisel/code/scikit-learn, inifile: setup.cfg
collected 422 items / 416 deselected / 6 selected
sklearn/gaussian_process/tests/test_gpr.py::test_sample_statistics[kernel0] Fatal Python error: Segmentation fault
Current thread 0x00007fe741e14740 (most recent call first):
File "<__array_function__ internals>", line 5 in dot
File "/home/ogrisel/code/scikit-learn/sklearn/gaussian_process/_gpr.py", line 410 in sample_y
File "/home/ogrisel/code/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py", line 171 in test_sample_statistics
File "/home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
Segmentation fault (core dumped)
$ conda list
# packages in environment at /home/ogrisel/miniconda3/envs/cf:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
attrs 19.3.0 py_0 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
certifi 2020.6.20 py38h32f6830_0 conda-forge
cython 0.29.21 py38h950e882_0 conda-forge
freetype 2.10.2 he06d7ca_0 conda-forge
joblib 0.16.0 py_0 conda-forge
jpeg 9d h516909a_0 conda-forge
lcms2 2.11 hbd6801e_0 conda-forge
ld_impl_linux-64 2.34 h53a641e_7 conda-forge
libblas 3.8.0 17_openblas conda-forge
libcblas 3.8.0 17_openblas conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgfortran-ng 7.5.0 hdf63c60_6 conda-forge
libgomp 9.2.0 h24d8f2e_2 conda-forge
liblapack 3.8.0 17_openblas conda-forge
libopenblas 0.3.10 pthreads_hb3c22a3_3 conda-forge
libpng 1.6.37 hed695b0_1 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
libtiff 4.1.0 hc7e4089_6 conda-forge
libwebp-base 1.1.0 h516909a_3 conda-forge
lz4-c 1.9.2 he1b5a44_1 conda-forge
more-itertools 8.4.0 py_0 conda-forge
ncurses 6.2 he1b5a44_1 conda-forge
numpy 1.19.1 py38h8854b6b_0 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
packaging 20.4 pyh9f0ad1d_0 conda-forge
pillow 7.2.0 py38h9776b28_1 conda-forge
pip 20.1.1 py_1 conda-forge
pluggy 0.13.1 py38h32f6830_2 conda-forge
py 1.9.0 pyh9f0ad1d_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pytest 5.4.3 py38h32f6830_0 conda-forge
python 3.8.5 h425cb1d_1_cpython conda-forge
python_abi 3.8 1_cp38 conda-forge
readline 8.0 he28a2e2_2 conda-forge
scikit-learn 0.24.dev0 dev_0 <develop>
scipy 1.5.2 py38h8c5af15_0 conda-forge
setuptools 49.2.0 py38h32f6830_0 conda-forge
six 1.15.0 pyh9f0ad1d_0 conda-forge
sqlite 3.32.3 hcee41ef_1 conda-forge
threadpoolctl 2.1.0 pyh5ca1d4c_0 conda-forge
tk 8.6.10 hed695b0_0 conda-forge
wcwidth 0.2.5 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
zstd 1.4.5 h6597ccf_1 conda-forge
This can be fixed by switching the env to use MKL instead of OpenBLAS:
conda install -c conda-forge libblas=*=*mkl
I can reproduce the segfault with the main channel openblas 0.3.10:
conda create -n tmp -y cython pillow numpy scipy pytest joblib threadpoolctl blas=*=*openblas
$ conda list
# packages in environment at /home/ogrisel/miniconda3/envs/tmp:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
attrs 19.3.0 py_0
blas 1.0 openblas
ca-certificates 2020.6.24 0
certifi 2020.6.20 py38_0
cython 0.29.21 py38he6710b0_0
freetype 2.10.2 h5ab3b9f_0
joblib 0.16.0 py_0
jpeg 9b h024ee3a_2
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libopenblas 0.3.10 h5a2b251_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
lz4-c 1.9.2 he6710b0_0
more-itertools 8.4.0 py_0
ncurses 6.2 he6710b0_1
numpy 1.18.5 py38h7130bb8_0
numpy-base 1.18.5 py38h2f8d375_0
olefile 0.46 py_0
openssl 1.1.1g h7b6447c_0
packaging 20.4 py_0
pillow 7.2.0 py38hb39fc2d_0
pip 20.1.1 py38_1
pluggy 0.13.1 py38_0
py 1.9.0 py_0
pyparsing 2.4.7 py_0
pytest 5.4.3 py38_0
python 3.8.3 hcff3b4d_2
readline 8.0 h7b6447c_0
scipy 1.5.0 py38habc2bb6_0
setuptools 49.2.0 py38_0
six 1.15.0 py_0
sqlite 3.32.3 h62c20be_0
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
wcwidth 0.2.5 py_0
wheel 0.34.2 py38_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h0b5b093_0
Pinning openblas to 0.3.9 fixes the issue. I tested with conda-forge using this env:
conda create -n cf -y -c conda-forge cython pillow numpy scipy pytest joblib threadpoolctl libopenblas=0.3.9
The segfault is happening in multivariable_normal
:
import numpy as np
y_mean = np.ones((5))
y_cov = np.ones((5, 5))
rng = np.random.RandomState(0)
# segfaults
rng.multivariate_normal(y_mean, y_cov, 300000)
Thanks @thomasjpfan, I was trying to slowly narrow it down. Not sure how this is related to openblas. Will use a debugger to run step by step.
In sklearn, the call to multivariable_normal
is made here:
scikit-learn/sklearn/gaussian_process/_gpr.py
Line 410 in 1278181
I used you code snippet to get a backtrace:
$ gdb python
(gdb) r /tmp/debug.py
Starting program: /home/ogrisel/miniconda3/envs/cf/bin/python /tmp/debug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5335700 (LWP 229282)]
[New Thread 0x7ffff4b34700 (LWP 229283)]
[New Thread 0x7ffff2333700 (LWP 229284)]
Thread 2 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5335700 (LWP 229282)]
0x00007ffff633cac3 in dgemm_oncopy_HASWELL () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
(gdb) bt
#0 0x00007ffff633cac3 in dgemm_oncopy_HASWELL () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#1 0x00007ffff56a736a in inner_thread () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#2 0x00007ffff57d65dd in blas_thread_server () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#3 0x00007ffff7f8d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4 0x00007ffff7eb4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
So it's a double precision matrix matrix multiplication that's crashing...
If you disable openblas blas threads, the crash goes away (both for your script and the original test):
OPENBLAS_NUM_THREADS=1 pytest -vlk test_sample_statistics sklearn/gaussian_process
Looks like this only segfaults when size
is large enough:
import numpy as np
y_mean = np.ones((5))
y_cov = np.ones((5, 5))
rng = np.random.RandomState(0)
# segfaults
rng.multivariate_normal(y_mean, y_cov, size=249033)
# does not segfault
rng.multivariate_normal(y_mean, y_cov, size=249032)
I think we have enough context to raise an issue on the numpy issue tracker.
I simplified it down to:
import numpy as np
np.ones(shape=(300000, 5)) @ np.ones(shape=(5, 5))
I will try to write a minimal C program to report it to the OpenBLAS developers.
This was fixed upstream in OpenMathLib/OpenBLAS#2729 and the conda-forge package has already been updated with the fix. Closing.