The goal is to build an efficient learner which I can use for my other projects.
We use;
- the 'soft watkins' td update (from Human-level Atari 200x faster) to help correct for off policy actions and allow the use of multi step returns.
- an exponential moving average target network to help stabilise training (I havent seen elsewhere, but havent properly looked. still needs to be evaluated -- WIP)
- (TODOs) uncertainty + discount / exploration / multiagent / reward normalisation / etc
There are also some replay buffers implemented using reverb.
- a replay buffer supporting multi-step returns,
- a multi agent replay buffer,
- a replay buffer supporting offline / prior data (from Efficient Online Reinforcement Learning with Offline Data)
Code is inspired in style by (/ copied from) the rlax examples.