dchetelat/acer

Question about Actor & Critic

Closed this issue · 2 comments

Thank you for this very nice work. If you permit me, I would like to ask you a question. I have a similar network where I have multiple heads for action probabilities and values (and I would like to keep having a single network or two networks where the majority of weights are shared due to the memory constraints). I am calculating the actor loss based on the minimum of the actual reward and the reward estimated by the critic. However, in that case, there is possibility that the critic just predicts very large rewards (despite there is a regularization term for the critic loss. in other words, actor and critic losses are summed before the backward) just so that the actor gets the real reward at the worst-case. Hence, I would like the critic not get effected by the updates with the actor loss. Is there any way to do that without having two networks? How ACER prevents the critic from predicting estimating very high rewards? Would it be sufficient to have two subsequent optimizer steps once for each losses (actor and critic) as in your code rather than summing them? Or can you please suggest me a solution to this problem? I have tried to find the answer to these in your repository since that ACER also utilizes a single network but since that it is a combination of several techniques, that was difficult for me. Thank you again for your time and consideration. Sincerely,

Hi,

I'm not one of the original ACER authors so you might have better luck directly asking them, but I can try answering based on my own understanding and experience with pytorch.

Hence, I would like the critic not get effected by the updates with the actor loss. Is there any way to do that without having two networks?

Well, if you're sharing weights, no. I don't see how you can do that.

How ACER prevents the critic from predicting estimating very high rewards?

I assume you mean returns. It doesn't?

Would it be sufficient to have two subsequent optimizer steps once for each losses (actor and critic) as in your code rather than summing them?

I don't think that would be equivalent, no, at least not with a momentum-based optimizer such as Adam. However, you could do two successive backwards followed by a single optimizer step, because backward calls add to parameter gradient buffers. That is,

(actor_loss + critic_loss).backward()
optimizer.step()

and

actor_loss.backward()
critic_loss.backward()
optimizer.step()

should be equivalent, but not

actor_loss.backward()
optimizer.step()
critic_loss.backward()
optimizer.step()

in general.

Or can you please suggest me a solution to this problem?

I'm not too sure why your critic predict exaggeratedly large returns, but I would focus on that in the first place. If you think it's early in the training, perhaps you could clip the gradients coming from the critic loss? E.g. doing something like

actor_critic.zero_grad()
critic_loss.backward(retain_graph=True)
torch.nn.utils.clip_grad_value_(actor_critic.parameters(), 1)
actor_loss.backward()
optimizer.step()

Best,
Didier

Hello; This was very helpful, thank you very much!