I made various changes to the MBPO codebase, facilitating my research.
- The actor and critic can use separate replay buffer, e.g. one with model-free data and the other with only model-argumented data.
- The rollout can be vine-style, i.e. choose several actions from one given state.
- Reproduce REDQ(https://arxiv.org/abs/2101.05982) and one can use Q ensemble together with MBPO. One can also train ensembled Q networks with different batch of data.
- The learning of the dynamics model can be delayed.
- One can save the checkpoint for the model network and policy network separately, and possibly reload them separately.
- The standard deviation of the policy and ensembled Q can be evaluated and plotted.
- Add gym reacher and fetch env.
- The experiment name and checkpoint frequency can be set at command line.