Doubts

Could you please let me know why there is a negative sign. I think that since we have already defined kl-divergence in the step before, we do not need a negative sign here. Please let me know how do you see this.

ACER/train.py

Line 81 in f22b07c

(-kl).backward(retain_graph=True)

The trust region function should be a translation of the one from ChainerRL, apart from the fact that the trust region involves some parameters, so if you think it should be the other way around then you should raise an issue there to let them know too.

I think since we have already defined the kl divergence before, we do not really need the negative sign. I am not sure why there is a negative sign.

Please let me know your views. Do you think it should be the other way round?

I'm not sure either, I just ported the code from Chainer for this part.

Let me know the difference in your implementation and OpenAI baselines.

I've not compared the two at all, and this is very low priority for me at the moment.

@Kaixhin, thank you for your input. I will look into the chainerrl and OpenAI baselines to get more insight about the implementation. I will just send a PR if needed.

Thanks

I am closing this issue because the current implementation seems to be correct. However, I think we need to detach the z_star_p for better results. Please refer to #13 .