Failed to reproduce with Docker
masterdezign opened this issue · 2 comments
masterdezign commented
Hi!
I ran the environment using the latest pre-built docker image. When training on RTX 3080Ti, I have the following failure:
python scripts/mbexp.py -env halfcheetah
2022-08-12 09:10:09.936602: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-08-12 09:10:10.312749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-12 09:10:10.312892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:07:00.0
totalMemory: 11.76GiB freeMemory: 11.52GiB
2022-08-12 09:10:10.312909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2022-08-12 09:13:36.133367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-08-12 09:13:36.133405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2022-08-12 09:13:36.133414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2022-08-12 09:13:36.133525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11135 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce R
TX 3080 Ti, pci bus id: 0000:07:00.0, compute capability: 8.6)
{'ctrl_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,
'opt_cfg': {'ac_cost_fn': <function HalfCheetahConfigModule.ac_cost_fn at 0x7f8c32140488>,
'cfg': {'alpha': 0.1,
'max_iters': 5,
'num_elites': 50,
'popsize': 500},
'mode': 'CEM',
'obs_cost_fn': <function HalfCheetahConfigModule.obs_cost_fn at 0x7f8c32140400>,
'plan_hor': 30},
'prop_cfg': {'mode': 'TSinf',
'model_init_cfg': {'model_class': <class 'dmbrl.modeling.models.BNN.BNN'>,
'model_constructor': <bound method HalfCheetahConfigModule.nn_constructor of <halfcheetah.HalfCheetahConfigModule object at 0x7f8c32138748>>,
'num_nets': 5},
'model_train_cfg': {'epochs': 5},
'npart': 20,
'obs_postproc': <function HalfCheetahConfigModule.obs_postproc at 0x7f8c321402f0>,
'obs_preproc': <function HalfCheetahConfigModule.obs_preproc at 0x7f8c32140268>,
'targ_proc': <function HalfCheetahConfigModule.targ_proc at 0x7f8c32140378>}},
'exp_cfg': {'exp_cfg': {'nrollouts_per_iter': 1, 'ntrain_iters': 300},
'log_cfg': {'logdir': 'log'},
'sim_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,
'task_hor': 1000}}}
Created an ensemble of 5 neural networks with variance predictions.
Created an MPC controller, prop mode TSinf, 20 particles.
Trajectory prediction logging is disabled.
Average action selection time: 1.0358095169067383e-05
Rollout length: 1000
Network training: 0%| | 0/5 [00:00<?, ?epoch(s)/s]2022-08-12 09:15:02.909158: E tensorflow/stream_executor/cuda/cuda_blas.cc:647] failed to run cuBLAS routine cublasSgemmBatched: CUBLAS_STATUS_EXECUTION_FAILED
2022-08-12 09:15:02.909199: E tensorflow/stream_executor/cuda/cuda_blas.cc:2505] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
[[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scripts/mbexp.py", line 44, in <module>
main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
File "scripts/mbexp.py", line 29, in main
exp.run_experiment()
File "/workspace/dmbrl/misc/MBExp.py", line 96, in run_experiment
[sample["rewards"] for sample in samples]
File "/workspace/dmbrl/controllers/MPC.py", line 180, in train
self.model.train(self.train_in, self.train_targs, **self.model_train_cfg)
File "/workspace/dmbrl/modeling/models/BNN.py", line 260, in train
feed_dict={self.sy_train_in: inputs[batch_idxs], self.sy_train_targ: targets[batch_idxs]}
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
[[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]
Caused by op 'model_1/MatMul', defined at:
File "scripts/mbexp.py", line 44, in <module>
main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
File "scripts/mbexp.py", line 22, in main
cfg.exp_cfg.exp_cfg.policy = MPC(cfg.ctrl_cfg)
File "/workspace/dmbrl/controllers/MPC.py", line 90, in __init__
)(params.prop_cfg.model_init_cfg)
File "/workspace/dmbrl/config/halfcheetah.py", line 83, in nn_constructor
model.finalize(tf.train.AdamOptimizer, {"learning_rate": 0.001})
File "/workspace/dmbrl/modeling/models/BNN.py", line 180, in finalize
train_loss = tf.reduce_sum(self._compile_losses(self.sy_train_in, self.sy_train_targ, inc_var_loss=True))
File "/workspace/dmbrl/modeling/models/BNN.py", line 436, in _compile_losses
mean, log_var = self._compile_outputs(inputs, ret_log_var=True)
File "/workspace/dmbrl/modeling/models/BNN.py", line 408, in _compile_outputs
cur_out = layer.compute_output_tensor(cur_out)
File "/workspace/dmbrl/modeling/layers/FC.py", line 71, in compute_output_tensor
raw_output = tf.matmul(input_tensor, self.weights) + self.biases
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1976, in matmul
a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1236, in batch_mat_mul
"BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
[[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]
masterdezign commented
Then I tried the latest supported Tensorflow version
pip install --upgrade pip
pip install tensorflow-gpu==1.15.0
masterdezign commented
I believe the problem is due to CUDA incompatibility. I have CUDA 11.2, which is not compatible with tensorflow-gpu==1.9
, nor tensorflow-gpu==1.15
(requires CUDA 10).