This repo contains simple PyTorch implementation of some Reinforcement Learning algorithms:
- Advantage Actor Critic (A2C) - a synchronous variant of A3C
- Proximal Policy Optimization (PPO) - one of the most popular RL algorithms PPO, Truly PPO, Implementation Matters, A Large-Scale Empirical Study of PPO
- On-policy Maximum A Posteriori Policy Optimization (V-MPO) - Algorithm that DeepMind used in their last works V-MPO (not working yet...)
- Behavior Cloning (BC) - simple technique to clone some expert behaviour into new policy
Each algorithm supports vector/image/dict observation spaces and discrete/continuous action spaces.
When I started this project and repo, I thought that Imitation Learning would be my main focus, and model-free methods would be used only at the beginning to train 'experts'. However, it appeared that PPO implementation (and its tricks) took more time than I expected. As the result now most of the code is related to PPO, but I am still interested in Imitation Learning and going to add several related algorithms.
For now this repo contains some model-free on-policy algorithm implementations: A2C, PPO, V-MPO and BC. Each algorithm supports discrete (Categorical, Bernoulli, GumbelSoftmax) and continuous (Beta, Normal, tanh(Normal)) policy distributions, and vector or image observation environments. Beta and tanh(Normal) works best in my experiments (tested on BipedalWalker and Humanoid environments).
As found in paper Implementation Matters, PPO algo works mostly because of "code-level" optimizations. Here I implemented most of them:
- Value function clipping (works better without it)
- Observation normalization & clipping
- Reward normalization/scaling & clipping
- Orthogonal initialization of neural network weights
- Gradient clipping
- Learning rate annealing (will be added)
In addition, I implemented roll-back loss from Truly PPO paper, which works very well, and 'advantage-recompute' option from A Large-Scale Empirical Study of PPO paper.
For image-observation environments I added special regularization similar to CURL paper, but instead of contrastive loss between features from convolutional feature extractor, I directly minimize D_KL between policies on augmented and non-augmented images.
As for Imitation Learning algorithms, there is only Behavior Cloning for now, but more will be added.
.
├── algorithms
├── agents # A2C, PPO, V-MPO, BC, any different agent algo...
└── ... # all different algorithm parts: neural networks, probability distributions, etc.
├── experts # checkpoints of trained models, *.pth files with nn model and weights and *.py scripts with model definition.
├── train_scripts # folder with train sripts.
├── ppo/humanoid.py # train script for humanoid environment.
├── ... # other train configs.
└── test.py # script for testing trained agent.
├── trainers # implementation of trainers for different algorithms. Trainer is a manager that controls data-collection, model optimization and testing.
└── utils # all other 'support' functions that does not fit in any other folder.
Each experiment is described in config, look at examples here: folder.
Example of training PPO agent on CartPole-v1 env:
python train_scripts/ppo/cart_pole.py
Training results (including training config, tensorboard logs and model checkpoints) will be saved in log_dir
folder.
Obtained policy:
Results of trained policy may be shown with train_scripts/test.py
script.
This script is able to:
- just show how policy acts in environment
- measure mean reward and episode len over some number of episodes
- record demo file with trajectories
Type python train_scripts/test.py -h
in the terminal to see detailed description of available arguments.
Demo file for BC is expected to be .pickle with episodes list inside. An episode is a list of [observations, actions, rewards], where observations = [obs_0, obs_1, ..., obs_T], same for action and rewards.
- Record demo file from trained policy:
python train_scripts/test.py -f logs/cart_pole/a2c_exp_0/ -p 10 -n 10 -d demo_files/cartpole_demo_10_ep.pickle -t -1 -r
- Prepare config to train BC: config
- Run BC training script:
python train_scripts/bc/cart_pole_10_episodes.py
- ???
- Enjoy policy:
python train_scripts/test.py -f logs_py/cart_pole/bc_10_episodes/ -p 1
Each agent have optional 'observation_encoder' and 'observation_normalizer' arguments. Observation encoder is an neural network (i.e. nn.Module), it applied directly to observation, typically an image. Observation normalizer is an running mean-variance estimator which standardize observations, it applied after encoder. Sometimes actor-critic trains better on such zero-mean unit-variance observations or embeddings.
To train your own neural network architecture you can just import it in config, initialize it in 'make_agent' function, and pass as 'actor_critic' argument into agent.
GIFs of some of results:
BipedalWalker-v3: mean reward ~333, 0 fails over 1000 episodes, config.
Humanoid-v3: mean reward ~11.3k, 14 fails over 1000 episodes, config.
Experiments with Humanoid done in mujoco v2 which have integration bug that makes environment easier. For academic purposes it is correct to use mujoco v1.5
CarRacing-v0: mean reward = 894 ± 32, 26 fails over 100 episodes (episode is considered failed if reward < 900), config.
V-MPO implementation trains slower than A2C. Probably because of not optimal hyper-parameters sampling, need to investigate.
- Add logging where it is possible
- Add Motion Imitation DeepMimic paper algo
- Add self-play trainer with PPO as backbone algo
- Support recurrent policies models and training
- ...