Hardlygo opened this issue 3 years ago · 0 comments
I think that the advantage value here should be base on the old actor target_v = reward + args.gamma * self.critic_net(next_state)
target_v = reward + args.gamma * self.critic_net(next_state)