cuBLAS / 'SGEMM launch failed' error

Question

cuBLAS / 'SGEMM launch failed' error

drichman opened this issue a year ago · 7 comments

Hi Ruben, I'm having this tricky error on one workstation but not on another. Both are installed through SBGrid (both are version 20220530_cu10), but the SBGrid team and I are both stumped at this point. TF_FORCE_GPU_ALLOW_GROWTH='true' has no effect.

On the system that fails (4x RTX A5000 24GB, but I'm only trying one at a time), here's the command and error:

/programs/x86_64-linux/deepemhancer/20220530_cu10/bin.capsules/deepemhancer -i /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_A.mrc -i2 /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_B.mrc -o /data/liuchuan/cryosparc_projects/CS-bill-dnab/J172/test1_DER_2023-06-08/J172_map_sharp_deepemhancer1.mrc -g 1 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget

updating environment to select gpu: [1]
Using TensorFlow backend.
loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 86.60254037844386 % of volume side
DONE!. Shape at 1 A/voxel after padding-> (352, 352, 352)
Neural net inference
0%| | 0/361 [00:00<?, ?it/s]2023-06-22 16:51:15.240037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/bin/deepemhancer", line 11, in
sys.exit(commanLineFun())
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
main( ** parseArgs() )
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 73, in main
voxel_size=boxSize, apply_postprocess_cleaning=cleaningStrengh)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 186, in predict
batch_y_pred= self.model.predict_on_batch(np.expand_dims(batch_x, axis=-1))
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/engine/training.py", line 1274, in predict_on_batch
outputs = self.predict_function(ins)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call
return self._call(inputs)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8
[[{{node conv3d_21/convolution}}]]
(1) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8
[[{{node conv3d_21/convolution}}]]
[[activation_10/Identity/_609]]
0 successful operations.
0 derived errors ignored.
0%| | 0/361 [01:37<?, ?it/s]

On the system that works (2x 2080 Ti 11GB), this command completes on the non-display GPU (-g 1), and the map is improved as expected.

But on the display GPU and or both (-g 0 and -g 0,1), it fails with a similar error, except CUBLAS_STATUS_NOT_INITIALIZED instead of CUBLAS_STATUS_EXECUTION_FAILED. Driver and CUDA versions are in the attached nvidia-smi screenshots of the two systems at rest, though I figure deepEMhancer is calling its preferred CUDA version installed via SBGrid.

Thanks for any insight --Dan

Answer 1 · 2023-06-23T21:45:44.000Z

Hi,

Can you report the deepEMhancer and Tensorflow versions that are installed? It would help if you also looked at the Cuda version installed within the environment. Using conda env export should print all the installed packages.

If I have to bet, I suggest installing a newer Tensorflow (together with a newer Cuda within the environment) can help.

Let me know what you have so that I can prepare an updated installation recipe.

Ruben

Answer 2 · 2023-06-27T15:26:18.000Z

I've learned that the SBGrid-curated version is a little tricky to pull that info from. But here's what I've gathered are the versions of tensorflow, deepemhancer, and cuda:

Tensorflow is 1.14.0 based on what the environment's Python reports:
exx@hawk:~$ /programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/bin/python
Python 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
print(tf.version)
1.14.0

And here's relevant parts of the conda-meta list:
/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda/envs/deepEMhancer_env/conda-meta

cudatoolkit-10.0.130-0.json
cudnn-7.6.5-cuda10.0_0.json

deepemhancer-0.13-py36_0.json

tensorboard-1.14.0-py36hf484d3e_0.json
tensorflow-1.14.0-gpu_py36h57aa796_0.json
tensorflow-base-1.14.0-gpu_py36h8d69cac_0.json
tensorflow-estimator-1.14.0-py_0.json
tensorflow-gpu-1.14.0-h0d30ee6_0.json
_tflow_select-2.1.0-gpu.json

And confirming the cuda libraries:
exx@hawk:/programs/x86_64-linux/deepemhancer/20220530_cu10/lib$ ls -l libcud*
lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so -> libcudart.so.10.0.130
lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so.10.0 -> libcudart.so.10.0.130
-rwxr-xr-x 1 exx exx 509104 Jan 23 2019 libcudart.so.10.0.130
lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so -> libcudnn.so.7.6.5
lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so.7 -> libcudnn.so.7.6.5
-rwxr-xr-x 1 exx exx 391638856 Dec 19 2019 libcudnn.so.7.6.5

Answer 3 · 2023-06-28T09:51:32.000Z

Thanks. Tensorflow 1.X does not work well on the new GPUs, so you need to install tensorflow 2.X. I would recommend you installing the latest version of deepEMhancer (0.16) which should work out of the box. If should be as easy as following the Readme instructions

Answer 4 · 2023-06-28T20:16:13.000Z

create_attempt.txt
I'm attaching output from the 'conda env create -f deepEMhancer_env.yml -n deepEMhancer_env' attempt that's not working, with 'UnsatisfiableError: The following specifications were found to be incompatible with each other...' I checked that all other environments were deactivated.

Answer 5 · 2023-06-29T13:54:37.000Z

Hi,

Can you try the following yml file instead?

name: deepEMhancer_env
channels:
  - conda-forge
  - defaults
dependencies:
  - cudatoolkit=11.8
  - cudnn=8.8
  - h5py=3.1
  - hdf5=1.10
  - joblib=1.3
  - mrcfile=1.4
  - numpy=1.19
  - pip=23.1
  - python=3.9
  - requests=2.31
  - ruamel.yaml=0.17
  - scikit-image=0.19
  - scipy=1.9
  - tensorboard=2.11
  - tensorflow-gpu=2.6
  - tqdm=4.65
  - yaml=0.2
  - conda-build=3.25

Answer 6 · 2023-06-29T19:10:04.000Z

That yml works, I finished the installation, and DeepEMhancer runs and outputs a viable map file. Thanks!

It did give this error at the end, but the run still worked:
(deepEMhancer_env) exx@hawk:/data/liuchuan/cryosparc_projects/CS-bill-dnab/J176$ deepemhancer -i J176_006_volume_map_half_A.mrc -i2 J176_006_volume_map_half_B.mrc -o J176_006_volume_map_deep.mrc -g 0,1,2,3 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget
updating environment to select gpu: [0, 1, 2, 3]
loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 86.60254037844386 % of volume side
DONE!. Shape at 1.00 A/voxel after padding-> (352, 352, 352)
Neural net inference
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 361/361 [05:56<00:00, 1.01it/s]
Exception ignored in: <function Pool.del at 0x7fef154e4d30>
Traceback (most recent call last):
File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/pool.py", line 268, in del
self._change_notifier.put(None)
File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Answer 7 · 2023-08-23T10:48:09.000Z

I am closing this, if you face problems, let me know