openai/train-procgen

Cannot train with GPU

HadiSDev opened this issue · 1 comments

I am trying to train with tensorflow-gpu==1.14 and cuda and cudnn is loaded correctly, However, when tensorflow has finished loading it gets stuck here:
image
'
After some time, I get this error:

2021-04-15 20:14:04.341576: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[{{node ppo2_model/pi_1/MatMul}}]]
	 [[ppo2_model/ArgMax/_443]]
  (1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[{{node ppo2_model/pi_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
    main()
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 106, in main
    comm=comm)
  File "/home/hadi/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
    max_grad_norm=0.5,
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 142, in learn
    obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/runner.py", line 29, in run
    actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 93, in step
    a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 75, in _evaluate
    return sess.run(variables, feed_dict)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/hadi/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
	 [[ppo2_model/ArgMax/_443]]
  (1) Internal: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 15), m=64, n=15, k=256
	 [[node ppo2_model/pi_1/MatMul (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:63) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node ppo2_model/pi_1/MatMul:
 ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)	
 ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)

Input Source operations connected to node ppo2_model/pi_1/MatMul:
 ppo2_model/flatten_1/Reshape (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py:44)	
 ppo2_model/pi/w/read (defined at /anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py:61)

Original stack trace for 'ppo2_model/pi_1/MatMul':
  File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/train/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Downloads/train-procgen/train_procgen/train.py", line 109, in <module>
    main()
  File "/Downloads/train-procgen/train_procgen/train.py", line 106, in main
    comm=comm)
  File "/Downloads/train-procgen/train_procgen/train.py", line 75, in train_fn
    max_grad_norm=0.5,
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/ppo2.py", line 109, in learn
    max_grad_norm=max_grad_norm, comm=comm, mpi_rank_weight=mpi_rank_weight)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/ppo2/model.py", line 37, in __init__
    act_model = policy(nbatch_act, 1, sess)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 175, in policy_fn
    **extra_tensors
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/policies.py", line 49, in __init__
    self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 65, in pdfromlatent
    pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/common/distributions.py", line 355, in _matching_fc
    return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/baselines/a2c/utils.py", line 63, in fc
    return tf.matmul(x, w)+b
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 2647, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5925, in mat_mul
    name=name)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/anaconda3/envs/train/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

It seems like this is a general issue with tf-1.14, so I am wondering how you guys had any luck with gpu training on this. I am training with command:
mpiexec --mca opal_cuda_support 1 -np 2 python -m train_procgen.train --env_name starpilot --num_levels 200 --distribution_mode easy --test_worker_interval 2

l met the same problem with you,have you solved it? My environment is: CUDA 10.0+cudnn7.6.4+Tensorflow 1.14.0