Monte Carlo Implementation of policy gradient
Closed this issue · 4 comments
Thanks for sharing such a nice tutorial!
Just have a question of the implementation of the policy gradient tute2. Is this a Moten Carlo implementation and would it be possible to convert it into a TD implementation?
The method of vanilla policy gradient (tuto2) is somehow incomplete, because value optimization is not separated. The next tutorial (tuto3), actor-critic method, then separates policy (the advantage in policy) and value explicitly, and as I'm writing in this tutorial notebook, you can take TD (temporal difference) approach for value optimization. Today's practical RL methods - such as, PPO, ... - also includes this actor-critic idea and you can take TD approach for value optim.
There's a lot of examples to apply TD approach in actor-critic-based methods, but I'm sorry I haven't seen applying TD in vanilla policy gradient.
Thanks for your clarification! However, I have another question regarding the KL term in the ppo notebook. The P_{\theta_old} and P_{\theta_new} terms are computed from the logits from the actor network directly, instead of the using the logprb computed from the F.cross_entropy between the action and logits. Is there a reason for doing this?
Sorry I am a bit confused. One more question is that the quantity reduction term for computing the l0 and l1. Is this just for some sort of normalization to stablize the training?
Thanks for you assistance in advance!
For the first question
The reason is because it just needs the sum of logits values in KL computation.
As you know, when the logits is (l_0, l_1, ... , l_{n-1})
, the probability of each actions are then :
(e^{l_0} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}}, e^{l_1} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}}, ... , e^{l_{n-1}} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}})
(Sorry but latex expression cannot be displayed in GitHub comment and it might a little bit be difficult to read ...)
When I need the probability of taking action a
(i.e, P(a | \theta)
), I have conveniently used cross_entropy()
function to compute this value.
But in KL computation,
- Firstly, it needs
ln {P(a | \theta_new) / P(a | \theta_old)}
and it then needs essentially logits itself. (Because the probability is exponential of logits, and then itsln
value is logits itself.) - Secondly, we needs the sum of all action's values (above values).
For this reason, it's very straightforward to calculate with logits instead of converting into probabilities with cross_entropy()
function.
For the second question
Again, when the logits is (l_0, l_1, ... , l_{n-1})
, the probability of each actions are then :
(e^{l_0} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}}, e^{l_1} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}}, ... , e^{l_{n-1}} / e^{l_0} + e^{l_1} + ... + e^{l_{n-1}})
And then the logit (l_0, l_1, ... , l_{n-1})
and (l_0 + x, l_1 + x, ... , l_{n-1} + x)
(adding value x
) has the same probability for arbitrary x
.
Thus I have just reduced the value (torch.amax(...)
) not to overflow.
Does my answer make for you ?
Thanks for you extremely clear explanation! It resolved all my concerns!!
Thanks again for sharing such a wonderful material to the community!