[RLlib|Tune|Train] ValueError: Could not recover from checkpoint as it does not exist anymore
ciroaceto opened this issue · 2 comments
What happened + What you expected to happen
Checkpoints from RLlib tune experiment (PBT Scheduler) are being deleted before another trial restores it. The reproduction script below had two incomplete trials (status: ERROR) with the following error.txt:
ValueError: Could not recover from checkpoint as it does not exist on storage anymore. Got storage fs type 'local' and path: /home/.../PPO_2024-05-07_13-41-09/PPO_Pendulum-v1_b3dc9_00005
The previous error message is from trial PPO_Pendulum-v1_b3dc9_00004. I suppose the exploitation mechanism were being applied and the checkpoint to be restored from wasn't updated.
Versions / Dependencies
ray: 2.10/2.20
gymnasium: 0.28.1
python: 3.10.13
OS: Ubuntu 22.04.4
Reproduction script
import random
from ray import tune, train
from ray.tune.schedulers import PopulationBasedTraining
from ray.rllib.algorithms.ppo import PPOConfig
# Postprocess the perturbed config to ensure it's still valid
def explore(config):
# ensure we collect enough timesteps to do sgd
if config["train_batch_size"] < config["sgd_minibatch_size"] * 2:
config["train_batch_size"] = config["sgd_minibatch_size"] * 2
# ensure we run at least one sgd iter
if config["num_sgd_iter"] < 1:
config["num_sgd_iter"] = 1
return config
pbt = PopulationBasedTraining(
time_attr="time_total_s",
perturbation_interval=10,
resample_probability=0.25,
# Specifies the mutations of these hyperparams
hyperparam_mutations={
"lambda": lambda: random.uniform(0.9, 1.0),
"clip_param": lambda: random.uniform(0.01, 0.5),
"lr": [5e-4, 1e-4, 5e-5, 1e-5],
"num_sgd_iter": lambda: random.randint(1, 10),
"sgd_minibatch_size": lambda: random.randint(128, 2048),
"train_batch_size": lambda: random.randint(1000, 5000),
},
custom_explore_fn=explore,
)
config_PPO = (
PPOConfig()
.environment(
'Pendulum-v1'
)
.rollouts(
num_rollout_workers=1,
num_envs_per_worker=1,
create_env_on_local_worker=True
)
.framework('torch')
.resources(
num_gpus=1,
num_cpus_per_worker=1,
)
.training(
lr=tune.choice([5e-4, 1e-4, 5e-5, 1e-5]),
kl_coeff=tune.choice([0.0, 0.5, 1.0]),
lambda_=tune.choice([0.9, 0.95, 0.99]),
clip_param=tune.choice([0.1, 0.2, 0.3]),
num_sgd_iter=tune.choice([3, 5, 7, 10]),
sgd_minibatch_size=tune.choice([128, 512, 1024]),
train_batch_size=tune.choice([2048, 4096]),
model={
'fcnet_hiddens': [256, 256]
}
)
.debugging(
log_level="DEBUG"
)
)
tuner = tune.Tuner(
"PPO",
tune_config=tune.TuneConfig(
metric="episode_reward_mean",
mode="max",
scheduler=pbt,
num_samples=10,
),
param_space=config_PPO,
run_config=train.RunConfig(
stop={"training_iteration": 100},
checkpoint_config=train.CheckpointConfig(
num_to_keep=3,
checkpoint_frequency=2,
checkpoint_score_order="max",
checkpoint_score_attribute="episode_reward_mean"
)
)
)
results = tuner.fit()
print("best hyperparameters: ", results.get_best_result().config)
Issue Severity
High: It blocks me from completing my task.
I have also tried to increase num_to_keep
to 4 and 5 (the code mentions that a low num_to_keep
can cause a similar issue), but the error remains. I also realized the checkpoint_frequency seems not to be working as intended. In previous ray versions the checkpoints were only created every checkpoint_frequency
iterations. Right now a checkpoint is created in each iteration.
@ciroaceto PBT with (very frequent) time-based checkpointing and also setting a low num_to_keep
is not very stable due to trial scheduling being nondeterministic. Here's a few tips to get this working:
- Use
training_iteration
as the perturbation interval unit instead oftime_total_s
:
checkpoint_frequency = 2
pbt = PopulationBasedTraining(
time_attr="time_total_s",
perturbation_interval=checkpoint_frequency,
...,
)
tuner = Tuner(
...,
checkpoint_config=train.CheckpointConfig(
# num_to_keep=4, # if disk space is not that big of an issue, keep all checkpoint. otherwise, increase this.
checkpoint_frequency=checkpoint_frequency,
)
)
- Another option is to set
synch=True
to make sure that all trials are in lock step, so the checkpoint assigned to a trial will never be missing. You should be able to set a lowernum_to_keep
in this scenario.
In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.
This may be a combination of a checkpoint folder naming change, as well as the time-based perturbation interval you have at the moment:
- Checkpoint folders are now named in terms of checkpoint index, rather than the
training_iteration
, starting from 0. It increments by 1 each time. - A checkpoint is forced to happen on every perturbation interval for high performing trials, which may cause the checkpointing to become more frequent.