[RLlib|Tune|Train] ValueError: Could not recover from checkpoint as it does not exist anymore

Question

[RLlib|Tune|Train] ValueError: Could not recover from checkpoint as it does not exist anymore

ciroaceto opened this issue 5 months ago · 2 comments

What happened + What you expected to happen

Checkpoints from RLlib tune experiment (PBT Scheduler) are being deleted before another trial restores it. The reproduction script below had two incomplete trials (status: ERROR) with the following error.txt:

ValueError: Could not recover from checkpoint as it does not exist on storage anymore. Got storage fs type 'local' and path: /home/.../PPO_2024-05-07_13-41-09/PPO_Pendulum-v1_b3dc9_00005

The previous error message is from trial PPO_Pendulum-v1_b3dc9_00004. I suppose the exploitation mechanism were being applied and the checkpoint to be restored from wasn't updated.

Versions / Dependencies

ray: 2.10/2.20
gymnasium: 0.28.1
python: 3.10.13
OS: Ubuntu 22.04.4

Reproduction script

import random
from ray import tune, train
from ray.tune.schedulers import PopulationBasedTraining
from ray.rllib.algorithms.ppo import PPOConfig


# Postprocess the perturbed config to ensure it's still valid
def explore(config):
    # ensure we collect enough timesteps to do sgd
    if config["train_batch_size"] < config["sgd_minibatch_size"] * 2:
        config["train_batch_size"] = config["sgd_minibatch_size"] * 2
    # ensure we run at least one sgd iter
    if config["num_sgd_iter"] < 1:
        config["num_sgd_iter"] = 1
    return config


pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=10,
    resample_probability=0.25,
    # Specifies the mutations of these hyperparams
    hyperparam_mutations={
        "lambda": lambda: random.uniform(0.9, 1.0),
        "clip_param": lambda: random.uniform(0.01, 0.5),
        "lr": [5e-4, 1e-4, 5e-5, 1e-5],
        "num_sgd_iter": lambda: random.randint(1, 10),
        "sgd_minibatch_size": lambda: random.randint(128, 2048),
        "train_batch_size": lambda: random.randint(1000, 5000),
    },
    custom_explore_fn=explore,
)

config_PPO = (
    PPOConfig()
    .environment(
        'Pendulum-v1'
    )
    .rollouts(
        num_rollout_workers=1,
        num_envs_per_worker=1,
        create_env_on_local_worker=True
    )
    .framework('torch')
    .resources(
        num_gpus=1,
        num_cpus_per_worker=1,
    )
    .training(
        lr=tune.choice([5e-4, 1e-4, 5e-5, 1e-5]),
        kl_coeff=tune.choice([0.0, 0.5, 1.0]),
        lambda_=tune.choice([0.9, 0.95, 0.99]),
        clip_param=tune.choice([0.1, 0.2, 0.3]),
        num_sgd_iter=tune.choice([3, 5, 7, 10]),
        sgd_minibatch_size=tune.choice([128, 512, 1024]),
        train_batch_size=tune.choice([2048, 4096]),
        model={
            'fcnet_hiddens': [256, 256]
        }
    )
    .debugging(
        log_level="DEBUG"
    )
)

tuner = tune.Tuner(
    "PPO",
    tune_config=tune.TuneConfig(
        metric="episode_reward_mean",
        mode="max",
        scheduler=pbt,
        num_samples=10,
    ),
    param_space=config_PPO,
    run_config=train.RunConfig(
        stop={"training_iteration": 100},
        checkpoint_config=train.CheckpointConfig(
            num_to_keep=3,
            checkpoint_frequency=2,
            checkpoint_score_order="max",
            checkpoint_score_attribute="episode_reward_mean"
        )
    )
)
results = tuner.fit()

print("best hyperparameters: ", results.get_best_result().config)

Issue Severity

High: It blocks me from completing my task.

Answer 1 · 2024-05-07T16:07:19.000Z

I have also tried to increase num_to_keep to 4 and 5 (the code mentions that a low num_to_keep can cause a similar issue), but the error remains. I also realized the checkpoint_frequency seems not to be working as intended. In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.

Answer 2 · 2024-05-07T19:00:10.000Z

@ciroaceto PBT with (very frequent) time-based checkpointing and also setting a low num_to_keep is not very stable due to trial scheduling being nondeterministic. Here's a few tips to get this working:

Use training_iteration as the perturbation interval unit instead of time_total_s:

checkpoint_frequency = 2

pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=checkpoint_frequency,
    ...,
)
tuner = Tuner(
    ...,
    checkpoint_config=train.CheckpointConfig(
        # num_to_keep=4,   # if disk space is not that big of an issue, keep all checkpoint. otherwise, increase this.
        checkpoint_frequency=checkpoint_frequency,
    )
)

Another option is to set synch=True to make sure that all trials are in lock step, so the checkpoint assigned to a trial will never be missing. You should be able to set a lower num_to_keep in this scenario.

In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.

This may be a combination of a checkpoint folder naming change, as well as the time-based perturbation interval you have at the moment:

Checkpoint folders are now named in terms of checkpoint index, rather than the training_iteration, starting from 0. It increments by 1 each time.
A checkpoint is forced to happen on every perturbation interval for high performing trials, which may cause the checkpointing to become more frequent.