Denys88/rl_games

loading checkpoint doesn't work for resuming training and test

Opened this issue · 9 comments

Hi! I'm trying the humanoid SAC example but it doesn't seem to work with loading checkpoint for test and resuming training. Here are what I did:

  1. training: python train.py task=HumanoidSAC train=HumanoidSAC
  2. Test: python train.py task=HumanoidSAC train=HumanoidSAC test=True checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth
  3. resume training: python train.py task=HumanoidSAC train=HumanoidSAC checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth

The training itself works fine and the reward went up to >5000, but if I test or resume from the saved checkpoint, it doesn't seem to initialize with the weights properly and reward is around 40. I did a quick check and it went through restore and set full state weights though, so I'm not so sure where might be the problem. One thing I did change was here: weights['step']->weights['steps'] due to the KeyError.

I'm using rl-games 1.6.0 and IsaacGym 1.0rc4.

Thank you!!

@qiuyuchen14 thank you for the reporting an issue. I'll take a look.

Thanks, @ViktorM ! Any updates?

I reproduced the issue, also found another one and will push the fix tomorrow.

Hi:

I've designed an environment in IsaacGym and am currently training an environment with the A2C Continuous PPO implementation. I am running into a similar error when trying to resume training from the same checkpoint or use a checkpoint for evaluation -- my training rewards are already converged around ~2000, while my evaluation rewards are ~200 for . As a more informative metric, the task terminates if the agent performs any unrecoverable behaviors such as falling down, or if episode length hits 700. My average episode length before termination in training is ~690-700 upon convergence, while my testing average episode length before termination is ~50-100. I do not have anything in my environment code that applies any change in environment in training or testing, and my training reward has consistently around 2000 (meaning I'm confident the issue is not an overfit) using 3e7 environment steps. I train with the following configuration:

params:
  seed: ${...seed}

  algo:
    name: a2c_continuous

  model:
    name: continuous_a2c_logstd

  network:
    name: actor_critic
    separate: False
    space:
      continuous:
        mu_activation: None
        sigma_activation: None

        mu_init:
          name: default
        sigma_init:
          name: const_initializer
          val: 0
        fixed_sigma: True
    mlp:
      units: [64, 32, 32]
      activation: elu
      
      initializer:
        name: default
      regularizer:
        name: None

  load_checkpoint: ${if:${...checkpoint},True,False} # flag which sets whether to load the checkpoint
  load_path: ${...checkpoint} # path to the checkpoint to load

  config:
    name: ${resolve_default:H1Reach,${....experiment}}
    full_experiment_name: ${.name}
    env_name: rlgpu
    multi_gpu: ${....multi_gpu}
    ppo: True
    mixed_precision: False
    normalize_input: True
    normalize_value: True
    reward_shaper:
      scale_value: 0.1
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95
    learning_rate: 2e-3
    lr_schedule: adaptive
    max_epochs: ${resolve_default:12000,${....max_iterations}}
    grad_norm: 1.0
    entropy_coef: 0.00
    truncate_grads: True
    horizon_length: 32
    bounds_loss_coef: 0.0001
    num_actors: ${....task.env.numEnvs}   
    schedule_type: standard
    kl_threshold: 0.008
    score_to_win: 1000000
    max_frames: 10_000_000_000
    save_best_after: 100
    save_frequency: 1000
    print_stats: True
    e_clip: 0.1
    minibatch_size: 32768
    mini_epochs: 4
    critic_coef: 4.0
    clip_value: True
    seq_len: 32

    player:
      deterministic: True
      games_num: 8000

I have tried player.deterministic being both True and False but it does not change the issue for me. As this thread has cited inconsistencies between training and testing checkpoints for other agent instances, I wanted to follow up with the above question. I am using rl-games-1.6.1, and have tried with the best checkpoint as well as with the checkpoints normally saved every save_freq iterations that are labeled with highest rewards. Thanks!

@sashwat-mahalingam could you try a regular ant. Does it work?

The regular Ant has the same issue. Below is the reward curve over epochs for two runs. The magenta one is an Ant policy I trained with PPO from scratch. The purple one is me training the Ant policy again from a checkpoint where the reward was 8039 (but at epoch 0 it achieves far below that reward in spite of having resumed from the checkpoint).

image

However, the testing reward does fine, achieving 9015.

@sashwat-mahalingam this one could be related to the way how I do reporting. When you restart a couple of first total reward reports would be reports from the failed ants because you don't get any results from good ants till 1000 step. I'll doublecheck it anywat.

I see; this makes sense. Do you have any pointers as to what else I could be missing that would cause my environment to show the discrepancy between training and testing results (since for the Ant environment this is clearly not the case)? While I make sure to not apply randomizations to my custom environment, I am wondering if there are some other configurations I am missing that changes the testing environment drastically from the training ones when using the RLGames training setup. I've already discussed trying stochastic vs. deterministic players above, but beyond that from my read of the RLGames codebase I am not sure what other configurations might cause this environment to behave much differently in training/testing.

Here is an example of how my yaml looks for the actual environment itself:

# used to create the object
name: H1Reach

physics_engine: ${..physics_engine}

# if given, will override the device setting in gym.

# set to True if you use camera sensors in the environment
enableCameraSensors: True

env:
  numEnvs: ${resolve_default:8192,${...num_envs}}
  envSpacing: 4.0

  clipActions: 1.0

  plane:
    staticFriction: 2.0
    dynamicFriction: 2.0
    restitution: 0.0

  asset:
    assetRoot: "../../assets"
    assetFileName: "urdf/h1_description/h1_model.urdf"
  
  stiffness: {'left_hip_yaw_joint': 32.0, 'left_hip_roll_joint': 32.0, 'left_hip_pitch_joint': 32.0, 'left_knee_joint': 16.0, 'left_ankle_joint': 4.0, 'right_hip_yaw_joint': 32.0, 'right_hip_roll_joint': 32.0, 'right_hip_pitch_joint': 32.0, 'right_knee_joint': 16.0, 'right_ankle_joint': 4.0, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 14.0, 'left_shoulder_roll_joint': 14.0, 'left_shoulder_yaw_joint': 12.0, 'left_elbow_joint': 4.0, 'right_shoulder_pitch_joint': 14.0, 'right_shoulder_roll_joint': 14.0, 'right_shoulder_yaw_joint': 12.0, 'right_elbow_joint': 4.0}
  damping: {'left_hip_yaw_joint': 5.0, 'left_hip_roll_joint': 9.3, 'left_hip_pitch_joint': 9.0, 'left_knee_joint': 2.5, 'left_ankle_joint': 0.24, 'right_hip_yaw_joint': 5.0, 'right_hip_roll_joint': 9.3, 'right_hip_pitch_joint': 9.0, 'right_knee_joint': 2.5, 'right_ankle_joint': 0.24, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 2.6, 'left_shoulder_roll_joint': 2.5, 'left_shoulder_yaw_joint': 1.2, 'left_elbow_joint': 0.48, 'right_shoulder_pitch_joint': 2.6, 'right_shoulder_roll_joint': 2.5, 'right_shoulder_yaw_joint': 1.2, 'right_elbow_joint': 0.48}

  epLen: 700

  fixRobot: False

  aliveReward: 5.0
  offAxisScale: 1.0
  lowerHeightScale: 1.0
  velScale: 0.0
  proximityScale: 0.1

sim:
  dt: 0.0166 #0.0166 # 1/60 s
  substeps: 12
  up_axis: "z"
  use_gpu_pipeline: ${eq:${...pipeline},"gpu"}
  gravity: [0.0, 0.0, -9.81]
  physx:
    num_threads: ${....num_threads}
    solver_type: ${....solver_type}
    use_gpu: True # ${contains:"cuda",${....sim_device}} # set to False to run on CPU
    num_position_iterations: 4
    num_velocity_iterations: 0
    contact_offset: 0.002
    rest_offset: 0.0001
    bounce_threshold_velocity: 0.2
    max_depenetration_velocity: 1000.0
    default_buffer_size_multiplier: 5.0
    max_gpu_contact_pairs: 1048576 # 1024*1024
    num_subscenes: ${....num_subscenes}
    contact_collection: 0 # 0: CC_NEVER (don't collect contact info), 1: CC_LAST_SUBSTEP (collect only contacts on last substep), 2: CC_ALL_SUBSTEPS (broken - do not use!)

task:
  randomize: False

I think the issue is that the physX solver is only deterministic under the same set of operations and seed. It would not act the exact same way if I try to load the checkpoint under the fixed seed or evaluate it vs train the model from scratch. I had assumed that maybe the distribution of physics steps would be the same given just the seed, but it seems like this is not guaranteed either and so I can’t expect to achieve the same expected performance during reruns. As I do not do domain randomization which the other tasks all seem to do, could adding this in fix the problem?