rstrivedi/Melting-Pot-Contest-2023

bugs when running the training code

Opened this issue · 4 comments

2023-09-11 10:21:49,337 ERROR tune_controller.py:911 -- Trial task failed for trial PPO_meltingpot_fcb07_00000
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2495, in get
raise value
File "python/ray/_raylet.pyx", line 1787, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1684, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1366, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1367, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1583, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 864, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.init() (pid=13902, ip=172.28.0.12, actor_id=7e027fca141b6dc2cdd8f15501000000, repr=PPO)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 517, in init
super().init(
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 169, in init
self.setup(copy.deepcopy(self.config))
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 639, in setup
self.workers = WorkerSet(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/worker_set.py", line 179, in init
raise e.args[0].args[2]
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in init
self._update_policy_map(policy_dict=self.policy_dict)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
self._build_policy_map(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
new_policy = create_policy_for_framework(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
return policy_class(observation_space, action_space, merged_config)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 64, in init
self._initialize_loss_from_dummy_batch()
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/policy.py", line 1418, in _initialize_loss_from_dummy_batch
actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 571, in compute_actions_from_input_dict
return self._compute_action_helper(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 1291, in _compute_action_helper
dist_inputs, state_out = self.model(input_dict, state_batches, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/modelv2.py", line 259, in call
res = self.forward(restored, state or [], seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 259, in forward
return super().forward(input_dict, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 98, in forward
output, new_state = self.forward_rnn(inputs, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 274, in forward_rnn
self._features, [h, c] = self.lstm(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 810, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
self.check_input(input, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 218, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 148, got 24

Does this caused by the version of the dependencies

Not sure but this might help in the meanwhile, #5 (comment)

When I install the packages it gives me the following error (I run this in colab):

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.24.3 which is incompatible.
tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.

These problems are not relevant to the installation of the provided dependencies. They seem to be stemming from other things installed in your environment. Could you try to use a virtual env instead of using system python or try to create a fresh virtual environment and check if it works?