facebookresearch/nle

[Colab][Ray] Ray actor dies launching NLE on Google Colab

adyomin opened this issue · 1 comments

I tried launching NLE using ray.rllib on Colab:

import gym
import nle
from nle.env.tasks import NetHackChallenge
import ray
from ray.rllib.agents.ppo.ppo_torch_policy import PPOTorchPolicy
from ray.rllib.evaluation import RolloutWorker


ray.init(ignore_reinit_error=True)

env_name = 'NetHackChallenge-v0'
worker_index = 1
num_workers = 1

config = ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG
config["framework"] = "torch"
config["num_workers"] = num_workers

rw = RolloutWorker.as_remote(num_cpus=1).remote(env_creator=lambda config: NetHackChallenge(config), 
                                                policy_spec=PPOTorchPolicy, 
                                                policy_config=config,
                                                worker_index=worker_index
                                               )

batch = ray.get(rw.sample.remote())
batch

And ran into a confusing error:

(pid=5575) 2021-08-27 08:03:17,161	ERROR worker.py:428 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=5575, ip=172.28.0.2)
(pid=5575)   File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 401, in __init__
(pid=5575)     self.env = env_creator(env_context)
(pid=5575)   File "<ipython-input-13-b05b234e6312>", line 20, in <lambda>
(pid=5575)   File "/usr/local/lib/python3.7/dist-packages/nle/env/tasks.py", line 341, in __init__
(pid=5575)     **kwargs,
(pid=5575)   File "/usr/local/lib/python3.7/dist-packages/nle/env/tasks.py", line 54, in __init__
(pid=5575)     super().__init__(*args, actions=actions, **kwargs)
(pid=5575)   File "/usr/local/lib/python3.7/dist-packages/nle/env/base.py", line 334, in __init__
(pid=5575)     spawn_monsters=spawn_monsters,
(pid=5575)   File "/usr/local/lib/python3.7/dist-packages/nle/nethack/nethack.py", line 104, in __init__
(pid=5575)     "Couldn't find NetHack installation at '%s'." % hackdir
(pid=5575) FileNotFoundError: Couldn't find NetHack installation at '/tmp/nleio58b51w'.

I can run the same script fine on my local machine. My guess is that it has something to do with the way NLE launches in general. Supposedly all ray does is launches some rllib code + NLE in a new process.

If anyone knows how to fix this or is sure that this is a ray related bug, please let me know.

Here is the notebook for reference.

And this is the !python collect_env.py output:

Collecting environment information...
NLE version: 0.7.3
PyTorch version: 1.9.0+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.5 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.21.1

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.9.0+cu102
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0+cu102
[conda] Could not collect

NVM. Complaining publicly did its magic, restarting the VM for the 3-rd time somehow solved the issue.