loading checkpoint doesn't work for resuming training and test
Opened this issue · 9 comments
Hi! I'm trying the humanoid SAC example but it doesn't seem to work with loading checkpoint for test and resuming training. Here are what I did:
- training:
python train.py task=HumanoidSAC train=HumanoidSAC
- Test:
python train.py task=HumanoidSAC train=HumanoidSAC test=True checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth
- resume training:
python train.py task=HumanoidSAC train=HumanoidSAC checkpoint=runs/HumanoidSAC_19-21-24-22/nn/HumanoidSAC.pth
The training itself works fine and the reward went up to >5000, but if I test or resume from the saved checkpoint, it doesn't seem to initialize with the weights properly and reward is around 40. I did a quick check and it went through restore and set full state weights though, so I'm not so sure where might be the problem. One thing I did change was here: weights['step']->weights['steps'] due to the KeyError.
I'm using rl-games 1.6.0 and IsaacGym 1.0rc4.
Thank you!!
@qiuyuchen14 thank you for the reporting an issue. I'll take a look.
Thanks, @ViktorM ! Any updates?
I reproduced the issue, also found another one and will push the fix tomorrow.
Hi:
I've designed an environment in IsaacGym and am currently training an environment with the A2C Continuous PPO implementation. I am running into a similar error when trying to resume training from the same checkpoint or use a checkpoint for evaluation -- my training rewards are already converged around ~2000, while my evaluation rewards are ~200 for . As a more informative metric, the task terminates if the agent performs any unrecoverable behaviors such as falling down, or if episode length hits 700. My average episode length before termination in training is ~690-700 upon convergence, while my testing average episode length before termination is ~50-100. I do not have anything in my environment code that applies any change in environment in training or testing, and my training reward has consistently around 2000 (meaning I'm confident the issue is not an overfit) using 3e7
environment steps. I train with the following configuration:
params:
seed: ${...seed}
algo:
name: a2c_continuous
model:
name: continuous_a2c_logstd
network:
name: actor_critic
separate: False
space:
continuous:
mu_activation: None
sigma_activation: None
mu_init:
name: default
sigma_init:
name: const_initializer
val: 0
fixed_sigma: True
mlp:
units: [64, 32, 32]
activation: elu
initializer:
name: default
regularizer:
name: None
load_checkpoint: ${if:${...checkpoint},True,False} # flag which sets whether to load the checkpoint
load_path: ${...checkpoint} # path to the checkpoint to load
config:
name: ${resolve_default:H1Reach,${....experiment}}
full_experiment_name: ${.name}
env_name: rlgpu
multi_gpu: ${....multi_gpu}
ppo: True
mixed_precision: False
normalize_input: True
normalize_value: True
reward_shaper:
scale_value: 0.1
normalize_advantage: True
gamma: 0.99
tau: 0.95
learning_rate: 2e-3
lr_schedule: adaptive
max_epochs: ${resolve_default:12000,${....max_iterations}}
grad_norm: 1.0
entropy_coef: 0.00
truncate_grads: True
horizon_length: 32
bounds_loss_coef: 0.0001
num_actors: ${....task.env.numEnvs}
schedule_type: standard
kl_threshold: 0.008
score_to_win: 1000000
max_frames: 10_000_000_000
save_best_after: 100
save_frequency: 1000
print_stats: True
e_clip: 0.1
minibatch_size: 32768
mini_epochs: 4
critic_coef: 4.0
clip_value: True
seq_len: 32
player:
deterministic: True
games_num: 8000
I have tried player.deterministic
being both True and False but it does not change the issue for me. As this thread has cited inconsistencies between training and testing checkpoints for other agent instances, I wanted to follow up with the above question. I am using rl-games-1.6.1
, and have tried with the best checkpoint as well as with the checkpoints normally saved every save_freq
iterations that are labeled with highest rewards. Thanks!
@sashwat-mahalingam could you try a regular ant. Does it work?
The regular Ant has the same issue. Below is the reward curve over epochs for two runs. The magenta one is an Ant policy I trained with PPO from scratch. The purple one is me training the Ant policy again from a checkpoint where the reward was 8039
(but at epoch 0 it achieves far below that reward in spite of having resumed from the checkpoint).
![image](https://private-user-images.githubusercontent.com/77760849/284080461-9f81a299-741a-4abe-8d1c-b6851a575580.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAyMTg5ODIsIm5iZiI6MTcyMDIxODY4MiwicGF0aCI6Ii83Nzc2MDg0OS8yODQwODA0NjEtOWY4MWEyOTktNzQxYS00YWJlLThkMWMtYjY4NTFhNTc1NTgwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA1VDIyMzEyMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYxMjY0YjdhYmYxYzIzZmVjYjM3OGU2MTgzZTY3YTdjN2M0MmNmNjU4NDY1NmM1YmJiODM1Y2Q4MGVkOTE5MmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Gk8mIAJLCsPZSGVpovfnGhOdNuFL0PWifw46QFlDZws)
However, the testing reward does fine, achieving 9015
.
@sashwat-mahalingam this one could be related to the way how I do reporting. When you restart a couple of first total reward reports would be reports from the failed ants because you don't get any results from good ants till 1000 step. I'll doublecheck it anywat.
I see; this makes sense. Do you have any pointers as to what else I could be missing that would cause my environment to show the discrepancy between training and testing results (since for the Ant environment this is clearly not the case)? While I make sure to not apply randomizations to my custom environment, I am wondering if there are some other configurations I am missing that changes the testing environment drastically from the training ones when using the RLGames training setup. I've already discussed trying stochastic vs. deterministic players above, but beyond that from my read of the RLGames codebase I am not sure what other configurations might cause this environment to behave much differently in training/testing.
Here is an example of how my yaml looks for the actual environment itself:
# used to create the object
name: H1Reach
physics_engine: ${..physics_engine}
# if given, will override the device setting in gym.
# set to True if you use camera sensors in the environment
enableCameraSensors: True
env:
numEnvs: ${resolve_default:8192,${...num_envs}}
envSpacing: 4.0
clipActions: 1.0
plane:
staticFriction: 2.0
dynamicFriction: 2.0
restitution: 0.0
asset:
assetRoot: "../../assets"
assetFileName: "urdf/h1_description/h1_model.urdf"
stiffness: {'left_hip_yaw_joint': 32.0, 'left_hip_roll_joint': 32.0, 'left_hip_pitch_joint': 32.0, 'left_knee_joint': 16.0, 'left_ankle_joint': 4.0, 'right_hip_yaw_joint': 32.0, 'right_hip_roll_joint': 32.0, 'right_hip_pitch_joint': 32.0, 'right_knee_joint': 16.0, 'right_ankle_joint': 4.0, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 14.0, 'left_shoulder_roll_joint': 14.0, 'left_shoulder_yaw_joint': 12.0, 'left_elbow_joint': 4.0, 'right_shoulder_pitch_joint': 14.0, 'right_shoulder_roll_joint': 14.0, 'right_shoulder_yaw_joint': 12.0, 'right_elbow_joint': 4.0}
damping: {'left_hip_yaw_joint': 5.0, 'left_hip_roll_joint': 9.3, 'left_hip_pitch_joint': 9.0, 'left_knee_joint': 2.5, 'left_ankle_joint': 0.24, 'right_hip_yaw_joint': 5.0, 'right_hip_roll_joint': 9.3, 'right_hip_pitch_joint': 9.0, 'right_knee_joint': 2.5, 'right_ankle_joint': 0.24, 'torso_joint': 0.0, 'left_shoulder_pitch_joint': 2.6, 'left_shoulder_roll_joint': 2.5, 'left_shoulder_yaw_joint': 1.2, 'left_elbow_joint': 0.48, 'right_shoulder_pitch_joint': 2.6, 'right_shoulder_roll_joint': 2.5, 'right_shoulder_yaw_joint': 1.2, 'right_elbow_joint': 0.48}
epLen: 700
fixRobot: False
aliveReward: 5.0
offAxisScale: 1.0
lowerHeightScale: 1.0
velScale: 0.0
proximityScale: 0.1
sim:
dt: 0.0166 #0.0166 # 1/60 s
substeps: 12
up_axis: "z"
use_gpu_pipeline: ${eq:${...pipeline},"gpu"}
gravity: [0.0, 0.0, -9.81]
physx:
num_threads: ${....num_threads}
solver_type: ${....solver_type}
use_gpu: True # ${contains:"cuda",${....sim_device}} # set to False to run on CPU
num_position_iterations: 4
num_velocity_iterations: 0
contact_offset: 0.002
rest_offset: 0.0001
bounce_threshold_velocity: 0.2
max_depenetration_velocity: 1000.0
default_buffer_size_multiplier: 5.0
max_gpu_contact_pairs: 1048576 # 1024*1024
num_subscenes: ${....num_subscenes}
contact_collection: 0 # 0: CC_NEVER (don't collect contact info), 1: CC_LAST_SUBSTEP (collect only contacts on last substep), 2: CC_ALL_SUBSTEPS (broken - do not use!)
task:
randomize: False
I think the issue is that the physX solver is only deterministic under the same set of operations and seed. It would not act the exact same way if I try to load the checkpoint under the fixed seed or evaluate it vs train the model from scratch. I had assumed that maybe the distribution of physics steps would be the same given just the seed, but it seems like this is not guaranteed either and so I can’t expect to achieve the same expected performance during reruns. As I do not do domain randomization which the other tasks all seem to do, could adding this in fix the problem?