Failed to reproduce with Docker

Question

Failed to reproduce with Docker

masterdezign opened this issue 2 years ago · 2 comments

Hi!

I ran the environment using the latest pre-built docker image. When training on RTX 3080Ti, I have the following failure:

python scripts/mbexp.py -env halfcheetah
2022-08-12 09:10:09.936602: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-08-12 09:10:10.312749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-12 09:10:10.312892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:07:00.0                                   
totalMemory: 11.76GiB freeMemory: 11.52GiB                                                                        
2022-08-12 09:10:10.312909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2022-08-12 09:13:36.133367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-08-12 09:13:36.133405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2022-08-12 09:13:36.133414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2022-08-12 09:13:36.133525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11135 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce R
TX 3080 Ti, pci bus id: 0000:07:00.0, compute capability: 8.6)                                                    
{'ctrl_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,
              'opt_cfg': {'ac_cost_fn': <function HalfCheetahConfigModule.ac_cost_fn at 0x7f8c32140488>,     
                          'cfg': {'alpha': 0.1,                                                                   
                                  'max_iters': 5,                                                                 
                                  'num_elites': 50,                                                               
                                  'popsize': 500},
                          'mode': 'CEM',                                                                          
                          'obs_cost_fn': <function HalfCheetahConfigModule.obs_cost_fn at 0x7f8c32140400>,
                          'plan_hor': 30},                                                                        
              'prop_cfg': {'mode': 'TSinf',           
                           'model_init_cfg': {'model_class': <class 'dmbrl.modeling.models.BNN.BNN'>,             
                                              'model_constructor': <bound method HalfCheetahConfigModule.nn_constructor of <halfcheetah.HalfCheetahConfigModule object at 0x7f8c32138748>>,
                                              'num_nets': 5},                                                                                                                                                                        
                           'model_train_cfg': {'epochs': 5},
                           'npart': 20,                                                                           
                           'obs_postproc': <function HalfCheetahConfigModule.obs_postproc at 0x7f8c321402f0>,
                           'obs_preproc': <function HalfCheetahConfigModule.obs_preproc at 0x7f8c32140268>,
                           'targ_proc': <function HalfCheetahConfigModule.targ_proc at 0x7f8c32140378>}},
 'exp_cfg': {'exp_cfg': {'nrollouts_per_iter': 1, 'ntrain_iters': 300},
             'log_cfg': {'logdir': 'log'},                                                                                                                                                                                           
             'sim_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,                                                                                                                                    
                         'task_hor': 1000}}} 
Created an ensemble of 5 neural networks with variance predictions.
Created an MPC controller, prop mode TSinf, 20 particles. 
Trajectory prediction logging is disabled.
Average action selection time:  1.0358095169067383e-05
Rollout length:  1000
Network training:   0%|                                                               | 0/5 [00:00<?, ?epoch(s)/s]2022-08-12 09:15:02.909158: E tensorflow/stream_executor/cuda/cuda_blas.cc:647] failed to run cuBLAS routine cublasSgemmBatched: CUBLAS_STATUS_EXECUTION_FAILED
2022-08-12 09:15:02.909199: E tensorflow/stream_executor/cuda/cuda_blas.cc:2505] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/mbexp.py", line 44, in <module>
    main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
  File "scripts/mbexp.py", line 29, in main
    exp.run_experiment()
  File "/workspace/dmbrl/misc/MBExp.py", line 96, in run_experiment
    [sample["rewards"] for sample in samples]
  File "/workspace/dmbrl/controllers/MPC.py", line 180, in train
    self.model.train(self.train_in, self.train_targs, **self.model_train_cfg)
  File "/workspace/dmbrl/modeling/models/BNN.py", line 260, in train
    feed_dict={self.sy_train_in: inputs[batch_idxs], self.sy_train_targ: targets[batch_idxs]}
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

Caused by op 'model_1/MatMul', defined at:
  File "scripts/mbexp.py", line 44, in <module>
    main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
  File "scripts/mbexp.py", line 22, in main
    cfg.exp_cfg.exp_cfg.policy = MPC(cfg.ctrl_cfg)
  File "/workspace/dmbrl/controllers/MPC.py", line 90, in __init__
    )(params.prop_cfg.model_init_cfg)
  File "/workspace/dmbrl/config/halfcheetah.py", line 83, in nn_constructor
    model.finalize(tf.train.AdamOptimizer, {"learning_rate": 0.001})
  File "/workspace/dmbrl/modeling/models/BNN.py", line 180, in finalize
    train_loss = tf.reduce_sum(self._compile_losses(self.sy_train_in, self.sy_train_targ, inc_var_loss=True))
  File "/workspace/dmbrl/modeling/models/BNN.py", line 436, in _compile_losses
    mean, log_var = self._compile_outputs(inputs, ret_log_var=True)
  File "/workspace/dmbrl/modeling/models/BNN.py", line 408, in _compile_outputs
    cur_out = layer.compute_output_tensor(cur_out)
  File "/workspace/dmbrl/modeling/layers/FC.py", line 71, in compute_output_tensor
    raw_output = tf.matmul(input_tensor, self.weights) + self.biases
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1976, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1236, in batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

Answer 1 · 2022-08-12T09:25:30.000Z

Then I tried the latest supported Tensorflow version

pip install --upgrade pip
pip install tensorflow-gpu==1.15.0

Answer 2 · 2022-08-12T14:05:08.000Z

I believe the problem is due to CUDA incompatibility. I have CUDA 11.2, which is not compatible with tensorflow-gpu==1.9, nor tensorflow-gpu==1.15 (requires CUDA 10).