sweetice/Deep-reinforcement-learning-with-pytorch

about the advantage values in PPO2

Hardlygo opened this issue · 0 comments

I think that the advantage value here should be base on the old actor
target_v = reward + args.gamma * self.critic_net(next_state)