Problem of `train_net()` in REINFORCE algorithm.

Question

Problem of `train_net()` in REINFORCE algorithm.

fuyw opened this issue 5 years ago · 5 comments

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

Answer 1 · 2019-11-29T04:28:14.000Z

That's right. policy has to be updated after for loop and i fixed it. you can find the changed code under this issue.

Answer 2 · 2019-11-29T07:53:17.000Z

Thanks for your reply.

Answer 3 · 2019-11-29T15:33:32.000Z

I find an interesting problem that: the policy converges slower after we fix the bug (compared in wall time).

Is this problem too easy that the policy can converge faster even it is updated per step? I don't know a reasonable explanation for this (or may it is just a coincidence). Could you please give me some hints?

Thank you very much.

Answer 4 · 2019-11-30T07:56:43.000Z

I didn't think about that. Good Question. I fixed slow converge problem.

But, master branch correctly implemented the REINFORCE pseudo code.
We are confused about why it doesn't care change of state distribution.
prob and r are in list(static value). So, updating every step doesn't effect to distribution.
and my code is updating all at once. I guess existing code is more good to understand.

Anyway, It was so interesting, I give up my exam next week for you. haha
Thanks.

Answer 5 · 2019-11-30T08:08:35.000Z

Thanks for your help. Really sorry for disturbing your exam.
Good luck and hope you can do well in the exam.
Many thanks.