about the advantage values in PPO2

Question

Hardlygo opened this issue 3 years ago · 0 comments

I think that the advantage value here should be base on the old actor
target_v = reward + args.gamma * self.critic_net(next_state)