Dose A2C support experience replay?

Question

Dose A2C support experience replay?

ShaoyuanLi opened this issue 6 years ago · 2 comments

I read your code and implement a version with experience replay.
However, I find that the loss explode after a few frames(almost 1000). Value loss would be very large and action loss would be very negatively large.Is it code error or A2C doesn't support experience replay in theory?

Answer 1 · 2019-02-16T06:26:05.000Z

It is an on-policy method. Old data is practically from another policy, so it isn't a very good idea to update the policy network on old samples. I'm not quite sure about the value estimator though. You might get away with using a replay buffer to train the value network only.

Answer 2 · 2019-03-26T03:35:21.000Z

csxeba is right, A2C and A3C are on-policy methods. Old datas are sampled by old policy, they are clearly not from the same distribution. We usually use a replay buffer to save the data sampled from the same policy, and after update we need to clear it.