ShangtongZhang/reinforcement-learning-an-introduction

Chapter 11

mattgithub1919 opened this issue · 12 comments

Hello,

Thank you for your work. I have a question about the semi_gradient_off_policy_TD function. It looks like it is using on-policy update in line 79 as next_state is uniform selection of 7 states while under off-policy it should only select LOWER STATE. In my understanding, Figure 11.2 does off-policy, not on-policy. Correct me if I am wrong. Thank you.

Warm regards,
Matt

Thank you for your response. I think in Figure 11.2, the target policy doesn't depend on action, it is 100% selecting LOWER STATE. You can check the highlighted sentences in the following pic.

Screen Shot 2020-04-05 at 5 52 22 PM

while under off-policy it should only select LOWER STATE.
If the agent follows the behavior policy (b), why it only selects LOWER STATE?

In behaviour policy, it uniformly selects all 7 states and that's how we get REWARD. However, in my understanding, you should use target policy(which selects LOWER state only) when calculating TD error.

Screen Shot 2020-04-05 at 6 04 55 PM

That's wrong. If you can sample next_state using the target policy, then it is not off-policy at all.

Yes, I agree with you. The problem is when you compute r + v(s', w) you used next_state as s'. next_state is behavior policy, not target policy. s' should be the state under target policy which is 100% LOWER STATE.

s' should be the state under target policy
This is wrong.

Not sure why you thought that was wrong. I think we shouldn't use next_state as s' as using next_state as s' will make it on-policy learning. The reason why it still diverges is because rho is computed according to off-policy.

When computing r + v(s', w), s^\prime should be sampled from behavior policy,
and next_state in the code is indeed sampled from the behavior policy

Figure 11.2 is off-policy Q learning, right? It would be on-policy if s' were using behavior-policy's next_state.

Figure 11.2 is off-policy Q learning, right?

It's off-policy TD.

It would be on-policy if s' were using behavior-policy's next_state.

This's wrong. You have fundamental misunderstanding about on- and off- policy.

OK. Thank you for your time. Have a great day!