About Variance Reduction Method

Question

About Variance Reduction Method

Closed this issue 6 years ago · 1 comments

In section 2.1 of the paper, the authors mention that the reward term is replaced with its advantage function. I have read the source code, but I still have some questions about the implementation of the variance reduction method.

how is it implemented in the code?
if the baseline function is set to a constant as I found (0.5 in the code ?), how this constant is obtained?
the parameters are updated with the training, should the baseline function be set to different values?
what if we just use simple policy gradient?

Look forward to your reply. Thanks.

Answer 1 · 2018-03-21T11:53:05.000Z

Yes, in this implementation, we simply use a constant baseline function (0.5), which is approximately the average of the reward for all actions. Although the parameters are continuously updated, the expectation of all the rewards remain close to 0.5 and empirically using the expected reward as a baseline performs well. Indeed, the optimal baseline should be the expected reward weighted by gradient magnitudes, but for simplicity, we didn't use that. However, it is well known that naive policy gradient suffer from the high variance and advantage function is an effective way to reduce the variance.