A question in the deterministic case
roosephu opened this issue · 3 comments
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L87
Should we here use new_action
or self.policy(next_state_batch)
?
- Correct, I also think we should use
self.policy.evaluate(next_state_batch)
. - I am still using a gaussian policy rather than a deterministic policy + fixed gaussian noise, as given in the paper.
- Also you will have to remove the entropy term in the policy loss, i.e.,
policy_loss = -(expected_new_q_value).mean()
(same as DDPG policy loss). This means that we will no longer require the regularization loss.
(Although I have not given your question a lot of thought but these 3 points seemed very clear to me when I read the paper again today. I am very busy at the moment (at least this week). So, if you can give me a week's time then I might get back to you with a bit more information. Also I have no idea why I made these mistakes -_- )
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L90
I've made some changes according to your query.
Let me know if there is anything else that, you think, is wrong in the implementation.
Nice! Your code is really helpful, thanks!