Multi-Step Bootstrapping with ReLAx
Example N-step TD3 implementation with ReLAx
The performance versus vanilla 1-step TD is measured by averaging learning curves (for separate evaluation environment) over 4 experiments with random environment seeds.
The results are summarized in the following plot:
The only difference in hyper-parameters settings between N-step TD3 and vanilla TD3 is the presence of multi-step bootstrapping. We can see a substantial advantage of 3-step version in terms of training speed as well as asymptotic performance by looking at the averaged curves. That shows that often N-step TD is the cheapest way of improving the performance of RL actor. Note that from task to task the incremental performance of using N-step TD may vary. For example, early experiments show that for Mujoco's Ant-v2 environment 3-step Bellman update works worse than 1-step version.
Resulting Policy