Chapter 11

Question

Chapter 11

mattgithub1919 opened this issue 5 years ago · 12 comments

mattgithub1919 commented 5 years ago

Hello,

Thank you for your work. I have a question about the semi_gradient_off_policy_TD function. It looks like it is using on-policy update in line 79 as next_state is uniform selection of 7 states while under off-policy it should only select LOWER STATE. In my understanding, Figure 11.2 does off-policy, not on-policy. Correct me if I am wrong. Thank you.

Warm regards,
Matt

Answer 1 · 2020-04-04T03:29:52.000Z

https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter11/counterexample.py#L73
It depends on action, doesn't it?

Answer 2 · 2020-04-05T21:53:25.000Z

Thank you for your response. I think in Figure 11.2, the target policy doesn't depend on action, it is 100% selecting LOWER STATE. You can check the highlighted sentences in the following pic.

Answer 3 · 2020-04-05T21:58:27.000Z

while under off-policy it should only select LOWER STATE.
If the agent follows the behavior policy (b), why it only selects LOWER STATE?

Answer 4 · 2020-04-05T22:07:45.000Z

In behaviour policy, it uniformly selects all 7 states and that's how we get REWARD. However, in my understanding, you should use target policy(which selects LOWER state only) when calculating TD error.

Answer 5 · 2020-04-05T22:09:41.000Z

That's wrong. If you can sample next_state using the target policy, then it is not off-policy at all.

Answer 6 · 2020-04-05T22:16:19.000Z

Yes, I agree with you. The problem is when you compute r + v(s', w) you used next_state as s'. next_state is behavior policy, not target policy. s' should be the state under target policy which is 100% LOWER STATE.

Answer 7 · 2020-04-05T22:18:11.000Z

s' should be the state under target policy
This is wrong.

Answer 8 · 2020-04-05T22:25:57.000Z

Not sure why you thought that was wrong. I think we shouldn't use next_state as s' as using next_state as s' will make it on-policy learning. The reason why it still diverges is because rho is computed according to off-policy.

Answer 9 · 2020-04-05T22:46:19.000Z

When computing r + v(s', w), s^\prime should be sampled from behavior policy,
and next_state in the code is indeed sampled from the behavior policy

Answer 10 · 2020-04-05T22:49:53.000Z

Figure 11.2 is off-policy Q learning, right? It would be on-policy if s' were using behavior-policy's next_state.

Answer 11 · 2020-04-05T22:53:11.000Z

Figure 11.2 is off-policy Q learning, right?

It's off-policy TD.

It would be on-policy if s' were using behavior-policy's next_state.

This's wrong. You have fundamental misunderstanding about on- and off- policy.

Answer 12 · 2020-04-06T01:51:53.000Z

OK. Thank you for your time. Have a great day!