scikit-learn/scikit-learn

sklearn/gaussian_process/tests/test_gpr.py:test_sample_statistics segfaults with libopenblas 0.3.10

Closed this issue · 13 comments

Steps to reproduce:

conda create -n cf -y -c conda-forge cython pillow numpy scipy pytest joblib threadpoolctl
conda activate cf
pip install -e . --no-build-isolation

Then:

$ pytest -vlk test_sample_statistics sklearn/gaussian_process
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.8.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1 -- /home/ogrisel/miniconda3/envs/cf/bin/python
cachedir: .pytest_cache
rootdir: /home/ogrisel/code/scikit-learn, inifile: setup.cfg
collected 422 items / 416 deselected / 6 selected                                                                                                                                                               

sklearn/gaussian_process/tests/test_gpr.py::test_sample_statistics[kernel0] Fatal Python error: Segmentation fault

Current thread 0x00007fe741e14740 (most recent call first):
  File "<__array_function__ internals>", line 5 in dot
  File "/home/ogrisel/code/scikit-learn/sklearn/gaussian_process/_gpr.py", line 410 in sample_y
  File "/home/ogrisel/code/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py", line 171 in test_sample_statistics
  File "/home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
Segmentation fault (core dumped)
$ conda list
# packages in environment at /home/ogrisel/miniconda3/envs/cf:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
attrs                     19.3.0                     py_0    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py38h32f6830_0    conda-forge
cython                    0.29.21          py38h950e882_0    conda-forge
freetype                  2.10.2               he06d7ca_0    conda-forge
joblib                    0.16.0                     py_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
lcms2                     2.11                 hbd6801e_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_7    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libopenblas               0.3.10          pthreads_hb3c22a3_3    conda-forge
libpng                    1.6.37               hed695b0_1    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libtiff                   4.1.0                hc7e4089_6    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
lz4-c                     1.9.2                he1b5a44_1    conda-forge
more-itertools            8.4.0                      py_0    conda-forge
ncurses                   6.2                  he1b5a44_1    conda-forge
numpy                     1.19.1           py38h8854b6b_0    conda-forge
olefile                   0.46                       py_0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
packaging                 20.4               pyh9f0ad1d_0    conda-forge
pillow                    7.2.0            py38h9776b28_1    conda-forge
pip                       20.1.1                     py_1    conda-forge
pluggy                    0.13.1           py38h32f6830_2    conda-forge
py                        1.9.0              pyh9f0ad1d_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pytest                    5.4.3            py38h32f6830_0    conda-forge
python                    3.8.5           h425cb1d_1_cpython    conda-forge
python_abi                3.8                      1_cp38    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
scikit-learn              0.24.dev0                 dev_0    <develop>
scipy                     1.5.2            py38h8c5af15_0    conda-forge
setuptools                49.2.0           py38h32f6830_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.32.3               hcee41ef_1    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.5                h6597ccf_1    conda-forge

This can be fixed by switching the env to use MKL instead of OpenBLAS:

conda install -c conda-forge libblas=*=*mkl

I can reproduce the segfault with the main channel openblas 0.3.10:

conda create -n tmp -y  cython pillow numpy scipy pytest joblib threadpoolctl blas=*=*openblas
$ conda list
# packages in environment at /home/ogrisel/miniconda3/envs/tmp:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
attrs                     19.3.0                     py_0  
blas                      1.0                    openblas  
ca-certificates           2020.6.24                     0  
certifi                   2020.6.20                py38_0  
cython                    0.29.21          py38he6710b0_0  
freetype                  2.10.2               h5ab3b9f_0  
joblib                    0.16.0                     py_0  
jpeg                      9b                   h024ee3a_2  
lcms2                     2.11                 h396b838_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libopenblas               0.3.10               h5a2b251_0  
libpng                    1.6.37               hbc83047_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                h2733197_1  
lz4-c                     1.9.2                he6710b0_0  
more-itertools            8.4.0                      py_0  
ncurses                   6.2                  he6710b0_1  
numpy                     1.18.5           py38h7130bb8_0  
numpy-base                1.18.5           py38h2f8d375_0  
olefile                   0.46                       py_0  
openssl                   1.1.1g               h7b6447c_0  
packaging                 20.4                       py_0  
pillow                    7.2.0            py38hb39fc2d_0  
pip                       20.1.1                   py38_1  
pluggy                    0.13.1                   py38_0  
py                        1.9.0                      py_0  
pyparsing                 2.4.7                      py_0  
pytest                    5.4.3                    py38_0  
python                    3.8.3                hcff3b4d_2  
readline                  8.0                  h7b6447c_0  
scipy                     1.5.0            py38habc2bb6_0  
setuptools                49.2.0                   py38_0  
six                       1.15.0                     py_0  
sqlite                    3.32.3               h62c20be_0  
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.10               hbc83047_0  
wcwidth                   0.2.5                      py_0  
wheel                     0.34.2                   py38_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.5                h0b5b093_0

Pinning openblas to 0.3.9 fixes the issue. I tested with conda-forge using this env:

conda create -n cf -y -c conda-forge cython pillow numpy scipy pytest joblib threadpoolctl libopenblas=0.3.9

The segfault is happening in multivariable_normal:

import numpy as np

y_mean = np.ones((5))
y_cov = np.ones((5, 5))
rng = np.random.RandomState(0)

# segfaults
rng.multivariate_normal(y_mean, y_cov, 300000)

Thanks @thomasjpfan, I was trying to slowly narrow it down. Not sure how this is related to openblas. Will use a debugger to run step by step.

In sklearn, the call to multivariable_normal is made here:

y_samples = rng.multivariate_normal(y_mean, y_cov, n_samples).T

I used you code snippet to get a backtrace:

$ gdb python
(gdb) r /tmp/debug.py
Starting program: /home/ogrisel/miniconda3/envs/cf/bin/python /tmp/debug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff5335700 (LWP 229282)]
[New Thread 0x7ffff4b34700 (LWP 229283)]
[New Thread 0x7ffff2333700 (LWP 229284)]

Thread 2 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5335700 (LWP 229282)]
0x00007ffff633cac3 in dgemm_oncopy_HASWELL () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
(gdb) bt
#0  0x00007ffff633cac3 in dgemm_oncopy_HASWELL () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#1  0x00007ffff56a736a in inner_thread () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#2  0x00007ffff57d65dd in blas_thread_server () from /home/ogrisel/miniconda3/envs/cf/lib/python3.8/site-packages/numpy/core/../../../../libcblas.so.3
#3  0x00007ffff7f8d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007ffff7eb4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So it's a double precision matrix matrix multiplication that's crashing...

If you disable openblas blas threads, the crash goes away (both for your script and the original test):

OPENBLAS_NUM_THREADS=1 pytest -vlk test_sample_statistics sklearn/gaussian_process

Looks like this only segfaults when size is large enough:

import numpy as np

y_mean = np.ones((5))
y_cov = np.ones((5, 5))
rng = np.random.RandomState(0)

# segfaults
rng.multivariate_normal(y_mean, y_cov, size=249033)

# does not segfault
rng.multivariate_normal(y_mean, y_cov, size=249032)

I think we have enough context to raise an issue on the numpy issue tracker.

I simplified it down to:

import numpy as np

np.ones(shape=(300000, 5)) @ np.ones(shape=(5, 5))

I will try to write a minimal C program to report it to the OpenBLAS developers.

This was fixed upstream in OpenMathLib/OpenBLAS#2729 and the conda-forge package has already been updated with the fix. Closing.