resume training based on restored checkpoint

Question

resume training based on restored checkpoint

researchyw20 opened this issue a year ago · 0 comments

I was trying to figure out how to resume training based on a restored checkpoint with run_ray_train.py. Specifically:

I was referring to this tutorial: https://docs.ray.io/en/latest/rllib/rllib-saving-and-loading-algos-and-policies.html
The code I was using is shown below:

from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune import registry
from baselines.train import make_envs


ray.init(local_mode=True, ignore_reinit_error=True)
registry.register_env("meltingpot", make_envs.env_creator)

## train mode, two failed attempts
my_ppo_config = PPOConfig().environment("meltingpot")
my_ppo = my_ppo_config.build()

# method1: fail at .build stage
PPOConfig().environment("meltingpot").build().restore(checkpoint_dir)

# method2: failed at .train stage
Algorithm.from_checkpoint(checkpoint_dir).train()

I came across KeyError, details shown as below:

ray::RolloutWorker.__init__() (pid=180001, ip=10.0.0.182, actor_id=17cb813ab79e0c981feebd6e01000000, repr=<ray.rllib.evaluation.rollout_worker._modify_class.<locals>.Class object at 0x7f6b2de58850>)
  File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 397, in __init__
    self.env = env_creator(copy.deepcopy(self.env_context))
  File "/home/researchyw20/meltingpot/code/Melting-Pot-Contest-2023/baselines/train/make_envs.py", line 10, in env_creator
    env = substrate.build(env_config['substrate'], roles=env_config['roles'])
  File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 909, in __getitem__
    raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'substrate'"

Any help on this is appreciated.