Example TD3 implementation with ReLAx

This repository contains an implementation of twin delayed deep deterministic policy gradient (TD3) with ReLAx.

TD3 actor was trained on Walker2d-v2 Mujoco Gym environment for 1m env-steps.

The graph of average return vs environment step is shown below (logs done every 10k steps):

The distribution of estimated Q-values vs data Q-values is shown below:

Resulting Policy:

td3_run.mp4

nslyubaykin/relax_td3_example