Multi-Step Bootstrapping with ReLAx

Example N-step TD3 implementation with ReLAx

The performance versus vanilla 1-step TD is measured by averaging learning curves (for separate evaluation environment) over 4 experiments with random environment seeds.

The results are summarized in the following plot:

The only difference in hyper-parameters settings between N-step TD3 and vanilla TD3 is the presence of multi-step bootstrapping. We can see a substantial advantage of 3-step version in terms of training speed as well as asymptotic performance by looking at the averaged curves. That shows that often N-step TD is the cheapest way of improving the performance of RL actor. Note that from task to task the incremental performance of using N-step TD may vary. For example, early experiments show that for Mujoco's Ant-v2 environment 3-step Bellman update works worse than 1-step version.

Resulting Policy

3_step_td3.mp4

nslyubaykin/nstep_td3

Multi-Step Bootstrapping with ReLAx