NVlabs/cule

dqn_main.py hangs

Closed this issue · 3 comments

Hi!

Installed cule with python 3.7 and pytorch 1.1.0, cuda 10.0. Execution of dqn_main.py hangs with the following messages:

GeForce GTX 1080 Ti : 1632.500 Mhz   (Ordinal 0)
28 SMs enabled. Compute Capability sm_61
FreeMem: 10,687MB   TotalMem: 11,178MB   64-bit pointers.
Mem Clock: 5505.000 Mhz x 352 bits   (484.4 GB/s)
ECC Disabled

Selected optimization level O0:  Pure FP32 training.

Defaults for this optimization level are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
DQN(
  (conv): Sequential(
    (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
  )
  (fc_a): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=6, bias=True)
  )
)
Initializing evaluation memory with 500 entries...

No activity on cpu and gpu

The same happens with a2c_main.py:

PyTorch  : 1.1.0
CUDA     : 10.0.130
CUDNN    : 7501
APEX     : 0.1.0

GeForce GTX 1080 Ti : 1632.500 Mhz   (Ordinal 0)
28 SMs enabled. Compute Capability sm_61
FreeMem: 10,687MB   TotalMem: 11,178MB   64-bit pointers.
Mem Clock: 5505.000 Mhz x 352 bits   (484.4 GB/s)
ECC Disabled

Looking into it - for the a2c case, can you try running it with --use-cuda-env --use-openai-test-env? This will do two things:

  1. --use-cuda-env runs CuLE on the GPU, otherwise CuLE envs run on the CPU;
  2. --use-openai-test-env will use openai (instead of CuLE CPU) for testing.

I got a similar problem:

$ python ppo_main.py --use-cuda-env --use-openai-test-env

{'ale_start_steps': 400,
 'alpha': 0.99,
 'batch_size': 256,
 'clip_epsilon': 0.1,
 'conf_file': None,
 'entropy_coef': 0.01,
 'env_name': 'PongNoFrameskip-v4',
 'episodic_life': False,
 'eps': 1e-05,
 'evaluation_episodes': 10,
 'evaluation_interval': 1000000,
 'gamma': 0.99,
 'gpu': 0,
 'local_rank': 0,
 'log_dir': 'runs',
 'loss_scale': None,
 'lr': 0.00065,
 'lr_scale': False,
 'max_episode_length': 18000,
 'max_grad_norm': 0.5,
 'multiprocessing_distributed': False,
 'no_cuda_train': True,
 'normalize': False,
 'num_ales': 16,
 'num_gpus_per_node': -1,
 'num_stack': 4,
 'num_steps': 5,
 'opt_level': 'O0',
 'output_filename': None,
 'plot': False,
 'ppo_epoch': 3,
 'profile': False,
 'save_interval': 0,
 'seed': 1565658549,
 't_max': 50000000,
 'tau': 1.0,
 'use_adam': False,
 'use_cuda_env': True,
 'use_gae': False,
 'use_openai': False,
 'use_openai_test_env': True,
 'value_loss_coef': 0.5,
 'verbose': False}

PyTorch  : 1.0.0
CUDA     : 10.0.130
CUDNN    : 7401
APEX     : 0.1.0

GeForce GTX 1080 Ti :    0.000 Mhz   (Ordinal 0)
131072 SMs enabled. Compute Capability sm_00
FreeMem: 11,019MB   TotalMem: 11,178MB   64-bit pointers.
Mem Clock:   98.304 Mhz x 0 bits   (  0.0 GB/s)
ECC Enabled

GPUassert: invalid device symbol /home/lkh/Codes/cule/cule/atari/cuda/tables.hpp 43