Chapter 11
mattgithub1919 opened this issue · 12 comments
Hello,
Thank you for your work. I have a question about the semi_gradient_off_policy_TD function. It looks like it is using on-policy update in line 79 as next_state is uniform selection of 7 states while under off-policy it should only select LOWER STATE. In my understanding, Figure 11.2 does off-policy, not on-policy. Correct me if I am wrong. Thank you.
Warm regards,
Matt
https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter11/counterexample.py#L73
It depends on action, doesn't it?
while under off-policy it should only select LOWER STATE.
If the agent follows the behavior policy (b), why it only selects LOWER STATE?
That's wrong. If you can sample next_state using the target policy, then it is not off-policy at all.
Yes, I agree with you. The problem is when you compute r + v(s', w) you used next_state as s'. next_state is behavior policy, not target policy. s' should be the state under target policy which is 100% LOWER STATE.
s' should be the state under target policy
This is wrong.
Not sure why you thought that was wrong. I think we shouldn't use next_state as s' as using next_state as s' will make it on-policy learning. The reason why it still diverges is because rho is computed according to off-policy.
When computing r + v(s', w), s^\prime should be sampled from behavior policy,
and next_state
in the code is indeed sampled from the behavior policy
Figure 11.2 is off-policy Q learning, right? It would be on-policy if s' were using behavior-policy's next_state.
Figure 11.2 is off-policy Q learning, right?
It's off-policy TD.
It would be on-policy if s' were using behavior-policy's next_state.
This's wrong. You have fundamental misunderstanding about on- and off- policy.
OK. Thank you for your time. Have a great day!