bugs when running the training code

Question

bugs when running the training code

Opened this issue a year ago · 4 comments

2023-09-11 10:21:49,337 ERROR tune_controller.py:911 -- Trial task failed for trial PPO_meltingpot_fcb07_00000
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2495, in get
raise value
File "python/ray/_raylet.pyx", line 1787, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1684, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1366, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1367, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1583, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 864, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.init() (pid=13902, ip=172.28.0.12, actor_id=7e027fca141b6dc2cdd8f15501000000, repr=PPO)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 517, in init
super().init(
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 169, in init
self.setup(copy.deepcopy(self.config))
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 639, in setup
self.workers = WorkerSet(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/worker_set.py", line 179, in init
raise e.args[0].args[2]
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in init
self._update_policy_map(policy_dict=self.policy_dict)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
self._build_policy_map(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
new_policy = create_policy_for_framework(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
return policy_class(observation_space, action_space, merged_config)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 64, in init
self._initialize_loss_from_dummy_batch()
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/policy.py", line 1418, in _initialize_loss_from_dummy_batch
actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 571, in compute_actions_from_input_dict
return self._compute_action_helper(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 1291, in _compute_action_helper
dist_inputs, state_out = self.model(input_dict, state_batches, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/modelv2.py", line 259, in call
res = self.forward(restored, state or [], seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 259, in forward
return super().forward(input_dict, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 98, in forward
output, new_state = self.forward_rnn(inputs, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 274, in forward_rnn
self._features, [h, c] = self.lstm(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 810, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
self.check_input(input, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 218, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 148, got 24

Answer 1 · 2023-09-11T10:24:20.000Z

Does this caused by the version of the dependencies

Answer 2 · 2023-09-11T17:57:17.000Z

Not sure but this might help in the meanwhile, #5 (comment)

Answer 3 · 2023-09-12T03:11:54.000Z

When I install the packages it gives me the following error (I run this in colab):

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.24.3 which is incompatible.
tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.

Answer 4 · 2023-09-13T21:36:00.000Z

These problems are not relevant to the installation of the provided dependencies. They seem to be stemming from other things installed in your environment. Could you try to use a virtual env instead of using system python or try to create a fresh virtual environment and check if it works?