log_ratio problem

Question

log_ratio problem

Closed this issue 6 years ago · 1 comments

Hello,
Thank you for sharing this wonderful implementation. In pytorch-maml-rl/maml_rl/metalearner.py line 142, when we are calculating the log ratio (which is essentially importance sampling) why do we put the old_pi in the denominator of the ratio? The advantages are sampled from pi, not old_pi so isn't the importance weight implementation wrong?

Answer 1 · 2018-12-21T00:10:07.000Z

Hi, thank you for the kind words!
This ratio comes from how the surrogate loss is defined in TRPO (for example, see this implementation of TRPO). However the advantages are indeed computed with samples from old_pi. old_pi is actually the policy we are updating, and pi is a "candidate" policy we could chose for the update (this may be an unfortunate naming convention). But you are right, there is kind of a distribution mismatch happening here, but this is an approximation that is done in TRPO. This great set of slides explains all this in details.