seungeunrho/minimalRL

Problem of `train_net()` in REINFORCE algorithm.

fuyw opened this issue · 5 comments

fuyw commented

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

That's right. policy has to be updated after for loop and i fixed it. you can find the changed code under this issue.

fuyw commented

Thanks for your reply.

fuyw commented

I find an interesting problem that: the policy converges slower after we fix the bug (compared in wall time).

Is this problem too easy that the policy can converge faster even it is updated per step? I don't know a reasonable explanation for this (or may it is just a coincidence). Could you please give me some hints?

Thank you very much.

I didn't think about that. Good Question. I fixed slow converge problem.

But, master branch correctly implemented the REINFORCE pseudo code.
We are confused about why it doesn't care change of state distribution.
prob and r are in list(static value). So, updating every step doesn't effect to distribution.
and my code is updating all at once. I guess existing code is more good to understand.

Anyway, It was so interesting, I give up my exam next week for you. haha
Thanks.

fuyw commented

Thanks for your help. Really sorry for disturbing your exam.
Good luck and hope you can do well in the exam.
Many thanks.