MG2033/A2C

Dose A2C support experience replay?

ShaoyuanLi opened this issue · 2 comments

I read your code and implement a version with experience replay.
However, I find that the loss explode after a few frames(almost 1000). Value loss would be very large and action loss would be very negatively large.Is it code error or A2C doesn't support experience replay in theory?

It is an on-policy method. Old data is practically from another policy, so it isn't a very good idea to update the policy network on old samples. I'm not quite sure about the value estimator though. You might get away with using a replay buffer to train the value network only.

csxeba is right, A2C and A3C are on-policy methods. Old datas are sampled by old policy, they are clearly not from the same distribution. We usually use a replay buffer to save the data sampled from the same policy, and after update we need to clear it.