[Colab][Ray] Ray actor dies launching NLE on Google Colab
adyomin opened this issue · 1 comments
adyomin commented
I tried launching NLE using ray.rllib on Colab:
import gym
import nle
from nle.env.tasks import NetHackChallenge
import ray
from ray.rllib.agents.ppo.ppo_torch_policy import PPOTorchPolicy
from ray.rllib.evaluation import RolloutWorker
ray.init(ignore_reinit_error=True)
env_name = 'NetHackChallenge-v0'
worker_index = 1
num_workers = 1
config = ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG
config["framework"] = "torch"
config["num_workers"] = num_workers
rw = RolloutWorker.as_remote(num_cpus=1).remote(env_creator=lambda config: NetHackChallenge(config),
policy_spec=PPOTorchPolicy,
policy_config=config,
worker_index=worker_index
)
batch = ray.get(rw.sample.remote())
batch
And ran into a confusing error:
(pid=5575) 2021-08-27 08:03:17,161 ERROR worker.py:428 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=5575, ip=172.28.0.2)
(pid=5575) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 401, in __init__
(pid=5575) self.env = env_creator(env_context)
(pid=5575) File "<ipython-input-13-b05b234e6312>", line 20, in <lambda>
(pid=5575) File "/usr/local/lib/python3.7/dist-packages/nle/env/tasks.py", line 341, in __init__
(pid=5575) **kwargs,
(pid=5575) File "/usr/local/lib/python3.7/dist-packages/nle/env/tasks.py", line 54, in __init__
(pid=5575) super().__init__(*args, actions=actions, **kwargs)
(pid=5575) File "/usr/local/lib/python3.7/dist-packages/nle/env/base.py", line 334, in __init__
(pid=5575) spawn_monsters=spawn_monsters,
(pid=5575) File "/usr/local/lib/python3.7/dist-packages/nle/nethack/nethack.py", line 104, in __init__
(pid=5575) "Couldn't find NetHack installation at '%s'." % hackdir
(pid=5575) FileNotFoundError: Couldn't find NetHack installation at '/tmp/nleio58b51w'.
I can run the same script fine on my local machine. My guess is that it has something to do with the way NLE launches in general. Supposedly all ray does is launches some rllib code + NLE in a new process.
If anyone knows how to fix this or is sure that this is a ray related bug, please let me know.
Here is the notebook for reference.
And this is the !python collect_env.py
output:
Collecting environment information...
NLE version: 0.7.3
PyTorch version: 1.9.0+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.5 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.21.1
Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.9.0+cu102
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0+cu102
[conda] Could not collect
adyomin commented
NVM. Complaining publicly did its magic, restarting the VM for the 3-rd time somehow solved the issue.