Denys88/rl_games

PPO performance for humanoid

Closed this issue · 3 comments

Hi, Nice work!
I noticed your work when I was looking at the Brax repository:)
In their paper, the Brax team mentioned that their PPO implementation didn't work well on humanoid, and this bug still exists now.
Previously I had suspected that there were some bugs with the Brax env.
But your performance on the humanoid seems to demonstrate that the problem may lie in their algorithm or hyperparameters.
I'd appreciate it if you could let me know if there's anything to note when you try humanoid with Brax.
Congratulations again on your excellent work.

I've run a few experiments with their ppo. You are right it doesn't work.
I had a few ideas why:

  1. reward_scale - in most cases scale should be like 0.1 or 0.01 instead of 10 or 30. They are using separate neural network for critic so it doesn't have significant impact but if they are using grad truncation and clip_value different reward scale can have significant impact on performance. Similar issue in SAC, I think they scale the value to find a good balance of the two terms of the reward (value and temperature). As I remember it is better to tune temperature learning rate than scale reward for SAC.
  2. They have entropy coef and it is pretty high. In 99% of the cases you don't need it for the continuous action space. log_std is independent vector and will go down overtime automatically.
    But this changes didn't affect performance.
    I think I've found a reason why but didn't test it yet:
    In brax they are using NormalTanhDistribution:
    mu has no activations like mine. sigma is represented as softplus. My tests showed that it is better to represent it as log_std.
    And then they apply tanh over it. So if replace softplus with exp(log_std) and use a regular NormalDistribution and truncate actions just before environment step call their implementation should work fine.

Btw thanks! :)
To answer why my solution is much faster.
I used the same configuration as was used in IsaacGym:

  1. Learning rate is adaptive based on KL divergence between old and a new policy if it more than 2*kl_threshold I decrease it if less than 0.5 * kl_threshold I increase it. It speedups training a lot ( for some reason it doesn't help that well for the discrete space)
  2. I used 512->256->128 neural network vs their
    policy_model = make_model([32, 32, 32, 32, policy_params_size], obs_size)
    value_model = make_model([256, 256, 256, 256, 256, 1], obs_size)
    and used ELU. It worked slightly better.
  3. I used only one network for both actor and critic. It speeds up training time. But to do it you need to be sure that value gradients are not very high. Two possible ways: o multiply reward by 0.01 or use value_normalization. Both works. In this case I used value normalization. I have running mean std statistics over value prediction so value loss and gradients will be not that far from zero. If use their reward scale one network for both value and policy will not work.
  4. entropy coef is zero and clip_value is False.
  5. also I have one more aux loss which stabilizes train. Not sure I need it in this case. It is called bound_loss. As we don't have any activation for mu for some cases if policy finds useful to always call -1 or 1. Predicted mu can go far away from this range so I added some loss to keep it near. It works better than tanh and have regularization effect.
    After that I don't truncate mu by actions_low and actions_high inside of the computational graph so gradients are not zeros.

I am going to apply my changes and test it. 99% it will work.

Thank you soooo much for your reply, very useful and insightful! Looking forward to your updates:)