google-deepmind/deepmind-research

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

hjyu94 opened this issue · 4 comments

hjyu94 commented

Hi. I'm trying to run the MeshGraphNets model but encountered an error with this command:

python -m meshgraphnets.run_model --mode=train --model=cloth --checkpoint_dir=meshgraphnets/dataset/chk --dataset_dir=meshgraphnets/dataset/flag_simple

The error is:

2023-10-09 14:41:01.689006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-10-09 14:41:01.689021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-10-09 14:41:01.689035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2023-10-09 14:41:01.689048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2023-10-09 14:41:01.689061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2023-10-09 14:41:01.689074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2023-10-09 14:41:01.689088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

...

2023-10-09 14:41:14.669500: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2023-10-09 14:41:14.669603: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 130, in <module>
    app.run(main)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 125, in main
    learner(model, params)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 82, in learner
    _, step, loss = sess.run([train_op, global_step, loss_op])
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/hyojeong/.local/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I've searched suggestions to upgrade TensorFlow to version 2.x on stackflows.
but the meshgraphnets/requirements.txt specifies tensorflow-gpu>=1.15,<2.

Has anyone faced this issue? Should I upgrade TensorFlow? I did try once, but it caused another problem.

Please let me know if you need the full error details or package versions.

Hi, when I run learning to simulate I met also this problem

*Solution

  1. check the compatibility for GPU driver/CUDA and cuDNN version/TensorFlow version
  2. set the memory growth in physic device for tf

我也遇到了类似的问题。
我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,可是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

我也遇到了类似的问题。 我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,但是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

你好,我在40系显卡中也遇到了上面的问题,有解决思路么? 能升级至 tf2.0么 如回复,不胜感谢