After the DDPG, some improvements have been proposed, by considering them together this became a new model, called Twin-Delayed DDPG (TD3) [1]. Please find theoretical informations about the TD3 model in the td3.pdf file.
- launcher_grid_calibration.py : Grid search for hyperparameters calibration.
- launcher_agent_training.py : Complete training of an agent.
- launcher_optimal_agent_playing.py : To see an optimal agent interacting with the environment.
The chosen environment is the Gym 'Pendulum-v0', a simple one where the agent has to swing up an inverted pendulum. In order to transpose the code into other environments, a modular implementation has been proposed. The main point to manage in order to do it is the Preprocessor class, and to adjust the hyperparameters.
A simple hyperparameter optimization has been performed, a grid search on the learning rate for the three neural networks. With the current code, it is easy to perform an optimization on other hyperparameters if necessary.
The environment will provide
+-------------+ a new state and a new reward +---------------------------------------------+
| Environment | at each new step | TD3 Agent |
+-------------+ +---------------------------------------------+
| +----------------------------------------> +-----------------------------------------+ |
| | | | Actor | Critic | |
| <----------------------------------------+ +------------------+----------------------+ |
+-------------+ The actor will choose an action | |
with respect to the state | |
+------------------+--^-----------------------+
| |
+ + The agent will store| |
| | new experiences in | | The agent will ask for
+----------------------------------+ the memory when | | past experiences in
If necessary, the tackling a step | | order to learn
communication with the | |
environment will require (PyTorch tensors) | |
to convert in PyTorch tensors | | (PyTorch tensors)
the informations at each step, | |
and vice versa +-------------------v--+----------------------+
| Memory |
+---------------------------------------------+
| |
| |
+---------------------------------------------+
The memory only interact with the agent
the experience will be stored as PyTorch
tensors
The current optimal agent has a mean score over 100 episode oscillating in general between -150 and -170.
It seems that some of the hyperparameters have a huge influence on the convergence speed. It would be suitable in the future to explore further hyperparameters optimization methods, such bayesian optimisation. However, this is very challenging due to the high dimension of the hyperparameters space (Agent's hyperparameters + 3 Neural Networks ones). A rigorous hyperparameters optimization would require a lot of computing power.
Another possible path is to implement a Prioritized Experience Replay in order to accelerate the convergence by choosing the right sample during the training process.