jcwleo/curiosity-driven-exploration-pytorch

Are you actually using the learned intrinsic reward for the agent?

Opened this issue · 6 comments

Hi,

I can only see that you optimize the intrinsic loss in your code. Can you point me to the line where you add the intrinsic rewards to the actual environment/extrinsic rewards?

In some areas of your code I can see comments like
# total reward = int reward
which would, according to the original paper, be wrong, no?

Thank you.

Also new to the repo, but here the loss is composed of both intrinsic and extrinsic reward:

loss = (actor_loss + 0.5 * critic_loss - 0.001 * entropy) + forward_loss + inverse_loss

Thanks @ruoshiliu. Yes, I saw the loss. But in addition to optimizing the loss you also need to use the intrinsic rewards (which is the result from optimizing its loss) for the agent as stated in the paper. Only optimizing the loss is not equivalent to using the intrinsic reward as an outcome of optimizing its loss.

@ferreirafabio What do you mean by "use the intrinsic rewards"? Can you point out which section in the paper stated that?

By that I mean reward = extrinsic reward + intrinsic reward. From the paper:

31B16992-8338-4C37-A1AE-6983E1EB9AF1

I now realize that the paper says the extrinsic reward can be optional. Wondering what is „usually“ used (with or without extrinsic reward) when peers use ICM as a baseline.

Thank you for the clarification. Let me make sure I understand your question. What you are saying is the code (referenced above) tries to minimize the loss function by maximizing the extrinsic reward and minimizing the intrinsic reward. The correct implementation should reflect the equation (7) below in which

In other words, the correct implementation should find the policy p that maximizes both intrinsic and extrinsic reward and parameters for inverse model and forward model that minimizes L_I and L_F.

Did I interpret your question correctly?

Screen Shot 2021-03-04 at 16 40 59