resume training based on restored checkpoint
researchyw20 opened this issue · 0 comments
researchyw20 commented
I was trying to figure out how to resume training based on a restored checkpoint with run_ray_train.py. Specifically:
- I was referring to this tutorial: https://docs.ray.io/en/latest/rllib/rllib-saving-and-loading-algos-and-policies.html
- The code I was using is shown below:
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune import registry
from baselines.train import make_envs
ray.init(local_mode=True, ignore_reinit_error=True)
registry.register_env("meltingpot", make_envs.env_creator)
## train mode, two failed attempts
my_ppo_config = PPOConfig().environment("meltingpot")
my_ppo = my_ppo_config.build()
# method1: fail at .build stage
PPOConfig().environment("meltingpot").build().restore(checkpoint_dir)
# method2: failed at .train stage
Algorithm.from_checkpoint(checkpoint_dir).train()
I came across KeyError, details shown as below:
ray::RolloutWorker.__init__() (pid=180001, ip=10.0.0.182, actor_id=17cb813ab79e0c981feebd6e01000000, repr=<ray.rllib.evaluation.rollout_worker._modify_class.<locals>.Class object at 0x7f6b2de58850>)
File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 397, in __init__
self.env = env_creator(copy.deepcopy(self.env_context))
File "/home/researchyw20/meltingpot/code/Melting-Pot-Contest-2023/baselines/train/make_envs.py", line 10, in env_creator
env = substrate.build(env_config['substrate'], roles=env_config['roles'])
File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 909, in __getitem__
raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'substrate'"
Any help on this is appreciated.