facebookresearch/minihack

[BUG] RLlib Basline doesn't work with docker image

Closed this issue ยท 2 comments

๐Ÿ› Bug

Using the bionic docker image, attempting to install and run the rllib agents results in errors.

To Reproduce

Steps to reproduce the behavior:

  1. download docker image + run
  2. pip install minihack[rllib]
  3. python -m minihack.agent.rllib.train algo=dqn env=MiniHack-Room-5x5-v0 total_steps=1000000

I then get the following error:

First set of errors /opt/conda/lib/python3.8/site-packages/ale_py/roms/__init__.py:94: DeprecationWarning: Automatic importing of atari-py roms won't be supported in future releases of ale-py. Please migrate over to using `ale-import-roms` OR an ALE-supported ROM package. To make this warning disappear you can run `ale-import-roms --import-from-pkg atari_py.atari_roms`.For more information see: https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management _RESOLVED_ROMS = _resolve_roms() Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 185, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/conda/lib/python3.8/runpy.py", line 111, in _get_module_details __import__(pkg_name) File "/opt/conda/lib/python3.8/site-packages/minihack/agent/rllib/__init__.py", line 3, in from .train import train File "/opt/conda/lib/python3.8/site-packages/minihack/agent/rllib/train.py", line 8, in import minihack.agent.rllib.models # noqa: F401 File "/opt/conda/lib/python3.8/site-packages/minihack/agent/rllib/models.py", line 24, in from ray.rllib.models import ModelCatalog File "/opt/conda/lib/python3.8/site-packages/ray/rllib/__init__.py", line 5, in from ray.rllib.env.base_env import BaseEnv File "/opt/conda/lib/python3.8/site-packages/ray/rllib/env/__init__.py", line 6, in from ray.rllib.env.policy_client import PolicyClient File "/opt/conda/lib/python3.8/site-packages/ray/rllib/env/policy_client.py", line 14, in from ray.rllib.policy.sample_batch import MultiAgentBatch File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/__init__.py", line 1, in from ray.rllib.policy.policy import Policy File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 9, in from ray.rllib.models.catalog import ModelCatalog File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/__init__.py", line 1, in from ray.rllib.models.action_dist import ActionDistribution File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/action_dist.py", line 4, in from ray.rllib.models.modelv2 import ModelV2 File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 7, in from ray.rllib.models.preprocessors import get_preprocessor, \ File "/opt/conda/lib/python3.8/site-packages/ray/rllib/models/preprocessors.py", line 2, in import cv2 File "/opt/conda/lib/python3.8/site-packages/cv2/__init__.py", line 9, in from .cv2 import _registerMatType ImportError: cannot import name '_registerMatType' from 'cv2.cv2' (/opt/conda/lib/python3.8/site-packages/cv2/cv2.cpython-38-x86_64-linux-gnu.so)

This appears to be an error with with opencv-python-headless. Alongside this I found that doing the following was needed:

pip install aiohttp==3.7.4
pip install async-timeout 3.0.1
pip install aioredis==1.3.1
pip install gym==0.15.3

And then you get the following error:

Second set of errors /opt/conda/lib/python3.8/runpy.py:127: RuntimeWarning: 'minihack.agent.rllib.train' found in sys.modules after import of package 'minihack.agent.rllib', but prior to execution of 'minihack.agent.rllib.train'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) 2022-04-11 20:10:50,714 INFO services.py:1267 -- View the Ray dashboard at http://127.0.0.1:8265 2022-04-11 20:10:50,720 WARNING services.py:1716 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.12gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2022-04-11 20:10:52,380 ERROR syncer.py:72 -- Log sync requires rsync to be installed. == Status == Memory usage on this node: 3.1/31.2 GiB Using FIFO scheduling algorithm. Resources requested: 9.0/9 CPUs, 0/0 GPUs, 0.0/18.39 GiB heap, 0.0/9.2 GiB objects Result logdir: /root/ray_results/DQN_2022-04-11_20-10-52 Number of trials: 1/1 (1 RUNNING) +-----------------------------+----------+-------+ | Trial name | status | loc | |-----------------------------+----------+-------| | DQN_RLlibNLE-v0_7ca8a_00000 | RUNNING | | +-----------------------------+----------+-------+

2022-04-11 20:10:52,835 WARNING worker.py:1115 -- The agent on node 5de9f6604cc5 failed with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 326, in
loop.run_until_complete(agent.run())
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 138, in run
modules = self._load_modules()
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
c = cls(self)
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 148, in init
self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
File "/opt/conda/lib/python3.8/site-packages/ray/_private/metrics_agent.py", line 74, in init
prometheus_exporter.new_stats_exporter(
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 266, in init
self.serve_http()
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
start_http_server(
File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

(raylet) Traceback (most recent call last):
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 338, in
(raylet) raise e
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 326, in
(raylet) loop.run_until_complete(agent.run())
(raylet) File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
(raylet) return future.result()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 138, in run
(raylet) modules = self._load_modules()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
(raylet) c = cls(self)
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 148, in init
(raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/metrics_agent.py", line 74, in init
(raylet) prometheus_exporter.new_stats_exporter(
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
(raylet) exporter = PrometheusStatsExporter(
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 266, in init
(raylet) self.serve_http()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
(raylet) start_http_server(
(raylet) File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
(raylet) TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet) File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
(raylet) infos = socket.getaddrinfo(address, port)
(raylet) File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
(raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known
(pid=1119) 2022-04-11 20:10:53,571 INFO trainer.py:694 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-04-11 20:10:54,630 WARNING worker.py:1115 -- The agent on node 5de9f6604cc5 failed with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 326, in
loop.run_until_complete(agent.run())
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 138, in run
modules = self._load_modules()
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
c = cls(self)
File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 148, in init
self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
File "/opt/conda/lib/python3.8/site-packages/ray/_private/metrics_agent.py", line 74, in init
prometheus_exporter.new_stats_exporter(
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 266, in init
self.serve_http()
File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
start_http_server(
File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

(raylet) Traceback (most recent call last):
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 338, in
(raylet) raise e
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 326, in
(raylet) loop.run_until_complete(agent.run())
(raylet) File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
(raylet) return future.result()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 138, in run
(raylet) modules = self._load_modules()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules
(raylet) c = cls(self)
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 148, in init
(raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/metrics_agent.py", line 74, in init
(raylet) prometheus_exporter.new_stats_exporter(
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 333, in new_stats_exporter
(raylet) exporter = PrometheusStatsExporter(
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 266, in init
(raylet) self.serve_http()
(raylet) File "/opt/conda/lib/python3.8/site-packages/ray/_private/prometheus_exporter.py", line 320, in serve_http
(raylet) start_http_server(
(raylet) File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
(raylet) TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet) File "/opt/conda/lib/python3.8/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
(raylet) infos = socket.getaddrinfo(address, port)
(raylet) File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
(raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known

With the socket.gaierror repeating. Thanks for the help!

Thanks for the issue. Currently, the dockers only support the TorchBeast framework with MiniHack. For RLlib related issues, please refer to RLlib documentation.

Howuhh commented

Hi @samvelyan! I encountered a very similar error, probably the same one, however it is not very clear above because apparently HYDRA_FULL_ERROR=1 was not exported there. The real error is not from rllib as far as I can tell:

ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=6979, ip=172.17.0.13)
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 383, in __init__
    self.env = _validate_env(env_creator(env_context))
  File "/opt/conda/lib/python3.8/site-packages/minihack/agent/rllib/envs.py", line 19, in __init__
    self.gym_env = create_env(env_config["flags"])
  File "/opt/conda/lib/python3.8/site-packages/minihack/agent/common/envs/tasks.py", line 186, in create_env
    env = env_class(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/minihack/envs/room.py", line 42, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.8/site-packages/minihack/envs/room.py", line 37, in __init__
    super().__init__(*args, des_file=lvl_gen.get_des(), **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/minihack/navigation.py", line 42, in __init__
    super().__init__(*args, des_file=des_file, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/minihack/base.py", line 302, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/nle/env/tasks.py", line 53, in __init__
    super().__init__(*args, actions=actions, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'archivefile'

Which is strange, as I do not see archivefile argument in any env within minihack. So, is there any way to fix this?